CN117853610A

CN117853610A - Modifying the pose of a two-dimensional human in a two-dimensional image

Info

Publication number: CN117853610A
Application number: CN202311286078.XA
Authority: CN
Inventors: G·戈里; 周易; 王杨抟风; 周洋; K·K·辛格; 尹在新; D·C·阿西特
Original assignee: Adobe Systems Inc
Current assignee: Adobe Inc
Priority date: 2022-10-06
Filing date: 2023-10-07
Publication date: 2024-04-09

Abstract

Embodiments of the present disclosure relate to modifying a pose of a two-dimensional human in a two-dimensional image. The present disclosure relates to systems, methods, and non-transitory computer-readable media for modifying a two-dimensional image via scene-based editing using a three-dimensional representation of the two-dimensional image. For example, in one or more embodiments, the disclosed system utilizes a three-dimensional representation of a two-dimensional image to generate and modify shadows in the two-dimensional image from various shadow maps. Furthermore, the disclosed system utilizes a three-dimensional representation of a two-dimensional image to modify a human in the two-dimensional image. The disclosed system also utilizes a three-dimensional representation of a two-dimensional image to provide a scene scale estimate via a scale field of the two-dimensional image. In some embodiments, the disclosed system utilizes a three-dimensional representation of a two-dimensional image to generate and visualize a 3D planar surface for modifying objects in the two-dimensional image. The disclosed system also uses a three-dimensional representation of the two-dimensional image to customize the focus of the two-dimensional image.

Description

Modifying the pose of a two-dimensional human in a two-dimensional image

Cross Reference to Related Applications

The present application claims priority from U.S. patent application Ser. No. 18/304,147, filed on 4 months of 2023, U.S. patent application Ser. No. 18/190,500, and U.S. patent application Ser. No. 18/190,513 filed on 27 months of 2023, which each claim the benefit and priority from U.S. provisional patent application Ser. No. 63/378,616, filed on 6 months of 2022, and are part of U.S. patent application Ser. No. 18/058,538, U.S. patent application Ser. No. 18/058, filed on 23 months of 2022, U.S. patent application Ser. No. 18/058,554, filed on 11 months of 2022, U.S. patent application Ser. No. 18/058, 575, U.S. patent application Ser. No. 18/058, 601, U.S. patent application Ser. No. 18/058,622, and U.S. patent application Ser. No. 18/058,630, filed on 11 months of 2022, and continuing to be applied on 630. The present application is also a continuation-in-part application of U.S. patent application Ser. No. 18/190,544, filed on day 27 of 3 months 2023, and U.S. patent application Ser. No. 18/190,556, filed on day 27 of 3 months 2023, each claiming the equity and priority of U.S. provisional patent application Ser. No. 63/378,616, filed on day 6 of 10 months 2022, and a continuation-in-part application of U.S. patent application Ser. No. 18/058,538, filed on day 23 of 11 months 2022, 18/058, 554, and U.S. patent application Ser. No. 18/058,601, filed on day 23 of 11 months 2022. The present application is also a continuation-in-part application of U.S. patent application Ser. No. 18/190,636 filed on day 27 of 3 and 2023, and U.S. patent application Ser. No. 18/190,654 filed on day 27 of 3 and 2023, each of which claims the benefit and priority of U.S. provisional patent application Ser. No. 63/378,616 filed on day 6 of 10 and 2022. The above application is incorporated by reference in its entirety.

Background

In recent years, there has been a significant advancement in hardware and software platforms for performing computer vision and image editing tasks. In practice, the system provides various image-related tasks such as object identification, classification, segmentation, composition, style transfer, image restoration, and the like.

Disclosure of Invention

One or more embodiments described herein provide benefits and/or solve one or more problems in the art with systems, methods, and non-transitory computer-readable media that implement artificial intelligence models to facilitate flexible and efficient scene-based image editing. To illustrate, in one or more embodiments, the system utilizes one or more machine learning models to learn/identify characteristics of digital images, predict potential edits to digital images, and/or generate supplemental components that may be used in various edits. Thus, the system obtains an understanding of the two-dimensional image as if it were a real scene, with different semantic regions reflecting real world (e.g., three-dimensional) conditions. Furthermore, the system enables two-dimensional images to be edited such that changes automatically and consistently reflect corresponding real world conditions without relying on additional user input. The system also provides realistic editing of two-dimensional objects in the two-dimensional image based on three-dimensional characteristics of the two-dimensional scene, such as by generating one or more three-dimensional grids based on the two-dimensional image. Thus, the system facilitates flexible and intuitive editing of digital images while effectively reducing user interaction typically required to make such editing.

Additional features and advantages of one or more embodiments of the disclosure are summarized in the description that follows, and in part will be apparent from the description, or may be learned by practice of such example embodiments.

Drawings

The present disclosure will describe one or more embodiments of the invention with additional specificity and detail through use of the accompanying drawings. The following paragraphs briefly introduce the drawings, in which:

FIG. 1 illustrates an example environment in which a scene-based image editing system operates in accordance with one or more embodiments;

FIG. 2 illustrates an overview of a scene-based image editing system editing a digital image into a real scene in accordance with one or more embodiments;

FIG. 3 illustrates a segmented neural network used by a scene-based image editing system to generate object masks for objects in accordance with one or more embodiments;

FIG. 4 illustrates the use of a cascade modulation repair neural network to generate a repair digital image in accordance with one or more embodiments;

FIG. 5 illustrates an example architecture of a cascade modulation repair neural network in accordance with one or more embodiments;

FIG. 6 illustrates a global modulation block and a spatial modulation block implemented in a cascaded modulation repair neural network in accordance with one or more embodiments;

FIG. 7 shows a diagram for generating object masking and content population to facilitate object-aware modification of a digital image in accordance with one or more embodiments;

8A-8D illustrate a graphical user interface implemented by a scene-based image editing system to facilitate mobile operations in accordance with one or more embodiments;

9A-9C illustrate a graphical user interface implemented by a scene-based image editing system to facilitate delete operations in accordance with one or more embodiments;

FIG. 10 illustrates an image analysis diagram for use by a scene-based image editing system in generating a semantic scene graph in accordance with one or more embodiments;

FIG. 11 illustrates a real world category description map for use by a scene-based image editing system in generating a semantic scene map in accordance with one or more embodiments;

FIG. 12 illustrates a behavior policy diagram for use by a scene-based image editing system in generating a semantic scene graph in accordance with one or more embodiments;

FIG. 13 illustrates a semantic scene graph of generating a digital image by a scene-based image editing system in accordance with one or more embodiments;

FIG. 14 shows an illustration of a semantic scene graph for generating a digital image using a template graph in accordance with one or more embodiments;

FIG. 15 shows another illustration of a semantic scene graph for generating a digital image in accordance with one or more embodiments;

FIG. 16 illustrates an overview of a multi-attribute contrast classification neural network in accordance with one or more embodiments;

FIG. 17 illustrates an architecture of a multi-attribute contrast classification neural network in accordance with one or more embodiments;

FIG. 18 illustrates an attribute-modifying neural network used by a scene-based image editing system to modify object attributes in accordance with one or more embodiments;

19A-19C illustrate a graphical user interface implemented by a scene-based image editing system to facilitate modifying object properties of objects depicted in a digital image in accordance with one or more embodiments;

FIGS. 20A-20C illustrate another graphical user interface implemented by a scene-based image editing system to facilitate modifying object properties of objects depicted in a digital image in accordance with one or more embodiments;

21A-21C illustrate yet another graphical user interface implemented by a scene-based image editing system to facilitate modifying object properties of objects depicted in a digital image in accordance with one or more embodiments;

FIGS. 22A-22D illustrate a graphical user interface implemented by a scene-based image editing system to facilitate relational awareness object modification in accordance with one or more embodiments;

23A-23C illustrate another graphical user interface implemented by a scene-based image editing system to facilitate relational awareness object modification in accordance with one or more embodiments;

24A-24C illustrate yet another graphical user interface implemented by a scene-based image editing system to facilitate relational aware object modification in accordance with one or more embodiments;

FIGS. 25A-25D illustrate a graphical user interface implemented by a scene-based image editing system to add objects to a selection for modification based on a classification relationship in accordance with one or more embodiments;

FIG. 26 illustrates a neural network pipeline used by a scene-based image editing system to identify and remove interfering objects from digital images in accordance with one or more embodiments;

FIG. 27 illustrates an architecture of an interferent detection neural network used by a scene-based image editing system to identify and classify interfering objects in a digital image in accordance with one or more embodiments;

FIG. 28 illustrates an architecture of a heatmap network used as part of an interferent detection neural network by a scene-based image editing system in accordance with one or more embodiments;

FIG. 29 illustrates an architecture of a hybrid classifier used as part of an interferent detection neural network by a scene-based image editing system in accordance with one or more embodiments;

FIGS. 30A-30C illustrate a graphical user interface implemented by a scene-based image editing system to identify and remove interfering objects from digital images in accordance with one or more embodiments;

31A-31C illustrate another graphical user interface implemented by a scene-based image editing system to identify and remove interfering objects from digital images in accordance with one or more embodiments;

FIGS. 32A-32B illustrate a scene-based image editing system utilizing intelligent dilation to remove objects from a digital image in accordance with one or more embodiments;

FIG. 33 illustrates an overview of a shadow detection neural network in accordance with one or more embodiments;

FIG. 34 illustrates an overview of an example segmentation component of a shadow detection neural network in accordance with one or more embodiments;

FIG. 35 illustrates an overview of an object perception component of a shadow detection neural network in accordance with one or more embodiments;

FIG. 36 illustrates an overview of a shadow prediction component of a shadow detection neural network in accordance with one or more embodiments;

FIG. 37 illustrates an overview of an architecture of a shadow detection neural network in accordance with one or more embodiments;

FIG. 38 illustrates a diagram for determining shadows associated with an object depicted in a digital image using a second stage of a shadow detection neural network, in accordance with one or more embodiments;

39A-39C illustrate a graphical user interface implemented by a scene-based image editing system to identify and remove shadows of objects depicted in a digital image in accordance with one or more embodiments;

FIG. 40 illustrates an overview of a scene-based image editing system modifying a two-dimensional image by placing two-dimensional objects according to three-dimensional characteristics of a scene of the two-dimensional image, in accordance with one or more embodiments;

FIG. 41 shows a diagram of a scene-based image editing system using multiple models to generate shadows for placing or moving objects in a two-dimensional image in accordance with one or more embodiments;

FIG. 42 shows an illustration of a scene-based image editing system generating shadow maps for a plurality of different types of objects in a two-dimensional image in accordance with one or more embodiments;

FIG. 43 shows a diagram of a scene-based image editing system determining illumination characteristics of a two-dimensional image in accordance with one or more embodiments;

FIG. 44 illustrates a diagram of a scene-based image editing system utilizing a rendering model to generate a modified two-dimensional image based on objects placed in the two-dimensional image in accordance with one or more embodiments;

FIG. 45 shows a diagram of a scene-based image editing system generating a segmented three-dimensional grid for a two-dimensional image in accordance with one or more embodiments;

46A-46C illustrate a graphical user interface for moving objects in a two-dimensional image with realistic shadow generation in accordance with one or more embodiments;

FIGS. 47A-47B illustrate example three-dimensional meshes corresponding to two-dimensional images in connection with generating a proxy three-dimensional mesh for an object in accordance with one or more embodiments;

48A-48B illustrate a graphical user interface for inserting an object from a first two-dimensional image into a second two-dimensional image in accordance with one or more embodiments;

FIG. 49 illustrates a graphical user interface for performing three-dimensional editing of an object inserted into a two-dimensional image in accordance with one or more embodiments;

FIG. 50 illustrates an overview of a scale field generated from a two-dimensional image using a scale field model in connection with performing one or more downstream operations by a scene-based image editing system in accordance with one or more embodiments;

FIG. 51 shows a diagram of a scene-based image editing system utilizing a machine learning model to generate a scale field and ground-to-horizon vectors from a two-dimensional image in accordance with one or more embodiments;

FIG. 52 illustrates a graphical representation of three-dimensional characteristics of a two-dimensional image projected into a three-dimensional space in accordance with one or more embodiments;

53A-53C illustrate a relationship diagram between two-dimensional characteristics of a two-dimensional image and three-dimensional characteristics of a two-dimensional image projected into a three-dimensional space in accordance with one or more embodiments;

FIG. 54 shows a diagram of a scene-based image editing system modifying parameters of one or more machine learning models based on a scale field and a horizon vector in accordance with one or more embodiments;

55A-55D illustrate image coverage in a process for determining a scale field of a two-dimensional image in accordance with one or more embodiments;

FIGS. 56A-56E illustrate a scale field generated for a cropped portion of a panoramic two-dimensional image in accordance with one or more embodiments;

FIG. 57 illustrates a two-dimensional image including an overlay indicating a metric distance in accordance with one or more embodiments;

FIG. 58 illustrates a plurality of two-dimensional images including an insertion object using different methods in accordance with one or more embodiments;

FIG. 59 illustrates an overview of a scene-based image editing system re-posing a two-dimensional human being extracted from a two-dimensional image via a three-dimensional representation of the two-dimensional human being in accordance with one or more embodiments;

FIG. 60 illustrates a diagram of a scene-based image editing system generating a three-dimensional human model representing a two-dimensional human extracted from a two-dimensional image in accordance with one or more embodiments;

FIGS. 61A-61D illustrate diagrams of a scene-based image editing system generating and refining a three-dimensional human model based on two-dimensional humans extracted from two-dimensional images, in accordance with one or more embodiments;

FIG. 62 shows a diagram of a scene-based image editing system generating a modified two-dimensional image in response to a two-dimensional human being re-gestured via a three-dimensional representation in accordance with one or more embodiments;

63A-63G illustrate a graphical user interface for modifying a two-dimensional image via a three-dimensional human model representing a two-dimensional human in the two-dimensional image in accordance with one or more embodiments;

FIG. 64 illustrates a digital image related to re-posing a two-dimensional human in a two-dimensional image in accordance with one or more embodiments;

FIG. 65 illustrates a digital image related to modifying a two-dimensional object interacting with a two-dimensional human in a two-dimensional image via a three-dimensional human model in accordance with one or more embodiments;

FIG. 66 illustrates an overview of a planar surface generated by a scene-based image editing system for display in connection with modifying objects in a two-dimensional image in accordance with one or more embodiments;

FIG. 67 illustrates a diagram of a scene-based image editing system determining a three-dimensional position of a portion of a selected object within a three-dimensional representation of a scene in a two-dimensional image for generating a planar surface in accordance with one or more embodiments;

FIGS. 68A-68E illustrate a graphical user interface for displaying a planar surface in connection with modifying a position of an object within a two-dimensional image in accordance with one or more embodiments;

69A-69B illustrate a graphical user interface for displaying different types of planar surfaces for modifying a selected object within a two-dimensional image in accordance with one or more embodiments;

FIGS. 70A-70C illustrate a graphical user interface for displaying and moving a planar surface in connection with modifying a selected object within a two-dimensional image in accordance with one or more embodiments;

FIG. 71 illustrates a graphical user interface for generating and displaying a three-dimensional bounding box that includes a planar surface for modifying a selected object within a two-dimensional image in accordance with one or more embodiments;

FIG. 72 illustrates a graphical user interface for modifying visual characteristics of a selected object intersecting additional objects in a scene in accordance with one or more embodiments;

FIG. 73 illustrates a blurred overview of a two-dimensional image modified by a scene-based image editing system based on the position of an input element in accordance with one or more embodiments;

FIG. 74A illustrates a diagram of a scene-based image editing system utilizing a three-dimensional representation with input elements to modify focus for applying blur to portions of a two-dimensional image in accordance with one or more embodiments;

FIG. 74B shows a diagram of a scene-based image editing system modifying camera parameters for modifying blur of a two-dimensional image via a three-dimensional renderer, in accordance with one or more embodiments;

FIG. 74C illustrates a diagram of a scene-based image editing system modifying, via a two-dimensional renderer, blur of a two-dimensional image according to depth values in a depth map, in accordance with one or more embodiments;

FIGS. 75A-75E illustrate a graphical user interface for modifying a focus of a two-dimensional image via an input element in accordance with one or more embodiments;

76A-76B illustrate a graphical user interface for modifying a focus of a two-dimensional image via various input elements in accordance with one or more embodiments;

77A-77C illustrate a graphical user interface for selecting portions of a two-dimensional image according to depth values in accordance with one or more embodiments;

FIGS. 78A-78C illustrate a graphical user interface for selecting an object of a two-dimensional image according to depth values in accordance with one or more embodiments;

FIG. 79 illustrates an overview of a depth displacement system that modifies a two-dimensional image based on a three-dimensional grid representing the two-dimensional image in accordance with one or more implementations;

FIG. 80 illustrates an overview of a depth displacement system generating a displacement three-dimensional grid representing a two-dimensional image in accordance with one or more implementations;

FIG. 81 illustrates a diagram of a depth displacement system determining density values of a two-dimensional image in accordance with one or more implementations;

FIG. 82 illustrates a diagram of a depth displacement system determining sample points of a two-dimensional image in one or more sampling iterations, in accordance with one or more implementations;

FIG. 83 illustrates a diagram of a depth displacement system generating a displacement three-dimensional grid based on sampling points of a two-dimensional image, in accordance with one or more implementations;

84A-84B illustrate a plurality of different displacement three-dimensional grids for a two-dimensional image in accordance with one or more implementations;

FIG. 85 illustrates a depth displacement system that generates a displaced three-dimensional grid for a two-dimensional image based on additional input in accordance with one or more implementations;

FIG. 86 illustrates an example schematic diagram of a scene-based image editing system in accordance with one or more embodiments;

FIG. 87 illustrates a flow diagram of a series of actions for modifying shadows in a two-dimensional image based on three-dimensional characteristics of the two-dimensional image, in accordance with one or more embodiments;

FIG. 88 illustrates a flow diagram of a series of acts for modifying shadows in a two-dimensional image with multiple shadow maps of an object of the two-dimensional image in accordance with one or more embodiments;

FIG. 89 illustrates a flow diagram of a series of acts for generating a scale field indicative of a pixel-to-metric distance ratio of a two-dimensional image in accordance with one or more embodiments;

FIG. 90 illustrates a flow diagram of a series of acts for generating a three-dimensional human model of a two-dimensional human in a two-dimensional image in accordance with one or more embodiments;

FIG. 91 illustrates a flowchart of a series of acts for modifying a two-dimensional image based on modifying a pose of a three-dimensional human model of a two-dimensional human representing the two-dimensional image in accordance with one or more embodiments;

FIG. 92 illustrates a flow diagram for generating a series of actions for transforming a planar surface of an object in a two-dimensional image based on a three-dimensional representation of the two-dimensional image in accordance with one or more embodiments;

FIG. 93 illustrates a flowchart of a series of acts for modifying a focus of a two-dimensional image based on a three-dimensional representation of the two-dimensional image in accordance with one or more embodiments; and

FIG. 94 illustrates a block diagram of an exemplary computing device in accordance with one or more embodiments.

Detailed Description

One or more embodiments described herein include a scene-based image editing system that uses intelligent image understanding to implement scene-based image editing techniques. Indeed, in one or more embodiments, the scene-based image editing system utilizes one or more machine learning models to process the digital image in anticipation of user interaction for modifying the digital image. For example, in some implementations, a scene-based image editing system performs operations that construct a knowledge set for a digital image and/or automatically initiate a workflow for certain modifications before receiving user input for those modifications. Based on this preprocessing, the scene-based image editing system facilitates user interaction with the digital image as if it were a real scene reflecting real world conditions. For example, a scene-based image editing system enables user interaction to target pre-processed semantic regions (e.g., objects that have been identified and/or masked via pre-processing) as distinct components for editing rather than individual underlying pixels. Further, the scene-based image editing system automatically modifies the digital image to consistently reflect the corresponding real world conditions.

As described above, in one or more embodiments, a scene-based image editing system utilizes machine learning to process digital images to anticipate future modifications. Specifically, in some cases, the scene-based image editing system uses one or more machine learning models to perform preparatory operations that will facilitate subsequent modifications. In some embodiments, the scene-based image editing system automatically performs preprocessing in response to receiving the digital image. For example, in some implementations, a scene-based image editing system collects data and/or initiates a workflow for editing digital images prior to receiving user input for such editing. Thus, scene-based image editing systems allow user interaction to directly indicate the desired edits to a digital image, rather than the various preparatory steps often used to make such edits.

As an example, in one or more embodiments, a scene-based image editing system pre-processes digital images to facilitate object-aware modification. In particular, in some embodiments, the scene-based image editing system pre-processes the digital image to anticipate user input for manipulating one or more semantic regions of the digital image, such as user input for moving or deleting one or more objects within the digital image.

To illustrate, in some cases, a scene-based image editing system utilizes a segmented neural network to generate object masks for each object depicted in a digital image. In some cases, the scene-based image editing system utilizes a hole-filling model to generate a content fill (e.g., repair clip) for each object (e.g., mask for each corresponding object). In some implementations, the scene-based image editing system generates a complete background for the digital image by filling the pre-filled object holes with corresponding content. Thus, in one or more embodiments, a scene-based image editing system pre-processes digital images to prepare for object-aware modifications, such as move operations or delete operations, by pre-generating object masks and/or content fills before user input for such modifications is received.

Thus, upon receiving one or more user inputs for object-aware modification (e.g., a move operation or a delete operation) of an object of a digital image, the scene-based image editing system completes the modification with a corresponding pre-generated object mask and/or content fill. For example, in some cases, a scene-based image editing system detects user interactions (e.g., user selections of objects) with objects depicted therein via a graphical user interface that displays a digital image. In response to a user interaction, the scene-based image editing system presents a previously generated corresponding object mask. The scene-based image editing system also detects a second user interaction with the object (e.g., with the presented object mask) for moving or deleting the object via the graphical user interface. Accordingly, the object is moved or deleted, displaying the content fills that were previously located behind the object.

Further, in one or more embodiments, a scene-based image editing system pre-processes a digital image to generate a semantic scene graph for the digital image. In particular, in some embodiments, a scene-based image editing system generates a semantic scene graph to map various characteristics of a digital image. For example, in some cases, a scene-based image editing system generates a semantic scene graph that describes objects depicted in a digital image, relationships or object properties of those objects, and/or various other characteristics determined to be available for subsequent modification of the digital image.

In some cases, the scene-based image editing system utilizes one or more machine learning models to determine characteristics of digital images to be included in the semantic scene graph. Further, in some cases, the scene-based image editing system utilizes one or more predetermined or pre-generated template maps to generate a semantic scene graph. For example, in some embodiments, the scene-based image editing system utilizes image analysis graphs, real-world category description graphs, and/or behavior policy graphs in generating semantic scenes.

Thus, in some cases, the scene-based image editing system facilitates modification of digital images using semantic scene graphs generated for the digital images. For example, in some embodiments, upon determining that an object has been selected for modification, the scene-based image editing system retrieves characteristics of the object from the semantic scene graph to facilitate the modification. To illustrate, in some implementations, a scene-based image editing system performs or suggests one or more additional modifications to a digital image based on characteristics from a semantic scene graph.

As one example, in some embodiments, upon determining that an object has been selected for modification, a scene-based image editing system provides one or more object properties of the object for display via a graphical user interface that displays the object. For example, in some cases, a scene-based image editing system retrieves a set of object properties (e.g., size, shape, or color) of an object from a corresponding semantic scene graph and presents the set of object properties for display in association with the object.

In some cases, the scene-based image editing system also facilitates user interaction with the displayed set of object properties to modify one or more object properties. For example, in some embodiments, the scene-based image editing system allows for user interaction of text that changes a displayed set of object properties or selections from a provided set of object property alternatives. Based on the user interaction, the scene-based image editing system modifies the digital image by modifying one or more object properties according to the user interaction.

As another example, in some implementations, a scene-based image editing system utilizes a semantic scene graph to implement relationship-aware object modification. To illustrate, in some cases, a scene-based image editing system detects a user interaction that selects an object depicted in a digital image for modification. The scene-based image editing system references a semantic scene graph previously generated for the digital image to identify a relationship between the object and one or more other objects depicted in the digital image. Based on the identified relationships, the scene-based image editing system also targets one or more related objects for modification.

For example, in some cases, the scene-based image editing system automatically adds one or more related objects to the user selection. In some cases, the scene-based image editing system provides a suggestion to include one or more related objects in the user selection and adds one or more related objects based on acceptance of the suggestion. Thus, in some embodiments, the scene-based image editing system modifies one or more related objects as it modifies the user-selected object.

In one or more embodiments, in addition to preprocessing the digital image to identify depicted objects and their relationships and/or object properties, the scene-based image editing system also preprocesses the digital image to help remove interfering objects. For example, in some cases, a scene-based image editing system utilizes an interferent detection neural network to classify one or more objects depicted in a digital image as subjects of the digital image and/or to classify one or more other objects depicted in the digital image as interfering objects. In some embodiments, the scene-based image editing system provides visual indications of interfering objects within the display of the digital image suggesting removal of these objects to present a more aesthetic and consistent visual result.

Furthermore, in some cases, the scene-based image editing system detects shadows of interfering objects (or other selected objects) for removal with the interfering objects. In particular, in some cases, a scene-based image editing system utilizes a shadow detection neural network to identify shadows depicted in a digital image and associate the shadows with their corresponding objects. Thus, upon removal of the interfering object from the digital image, the scene-based image editing system also automatically removes the associated shadow.

In one or more embodiments, the scene-based image editing system further provides editing of the two-dimensional image based on three-dimensional characteristics of a scene in the two-dimensional image. For example, a scene-based image editing system utilizes depth estimation of a scene of a two-dimensional image to generate a three-dimensional grid representing foreground/background objects in the scene. Furthermore, scene-based image editing systems utilize three-dimensional characteristics to provide realistic editing of objects within a two-dimensional image according to three-dimensional relative positions. To illustrate, a scene-based image editing system provides shadow generation or focus determination based on three-dimensional characteristics of a scene in a two-dimensional image. In additional embodiments, the scene-based image editing system provides three-dimensional human modeling with interactive re-pose.

Scene-based image editing systems offer advantages over conventional systems. In fact, conventional image editing systems suffer from several technical drawbacks, resulting in inflexible and inefficient operation. To illustrate, conventional systems are often inflexible in that they strictly perform editing on digital images at the pixel level. In particular, conventional systems typically perform a particular edit by targeting pixels individually for the edit. Thus, such systems often strictly require user interaction for editing a digital image to interact with a single pixel to indicate the region for editing. Furthermore, many conventional systems (e.g., due to their pixel-based editing) require the user to have a great deal of in-depth expertise in how to interact with the digital image and the user interface of the system itself, to select the desired pixels and to perform the appropriate workflow to edit those pixels.

Furthermore, conventional image editing systems often do not function efficiently. For example, conventional systems typically require a significant amount of user interaction to modify the digital image. In fact, conventional systems typically require a user to interact with multiple menus, sub-menus, and/or windows to perform editing, in addition to user interaction for selecting a single pixel. For example, many edits may require multiple editing steps to be performed using multiple different tools. Thus, many conventional systems require multiple interactions to select the appropriate tool at a given editing step, set the required parameters for the tool, and utilize the tool to perform the editing step.

The scene-based image editing system has improved flexibility compared to conventional systems. In particular, scene-based image editing systems implement techniques that facilitate flexible scene-based editing. For example, by preprocessing a digital image via machine learning, a scene-based image editing system allows editing a digital image as if it were a real scene, where the various elements of the scene are known and can interact with it on a semantic level to perform editing while continuously reflecting real world conditions. Indeed, where a pixel is the target unit under many conventional systems and an object is typically considered a group of pixels, a scene-based image editing system allows a user to interactively treat the entire semantic region (e.g., object) as a distinct unit. Furthermore, where conventional systems typically require in-depth, specialized knowledge of the tools and workflows required to perform editing, scene-based editing systems provide a more intuitive editing experience that enables users to focus on the final goals of editing.

Furthermore, scene-based image editing systems have higher efficiency than conventional systems. In particular, the scene-based image editing system implements a graphical user interface that reduces the user interaction required for editing. In effect, by preprocessing the digital image for desired editing, the scene-based image editing system reduces the user interaction required to perform editing. In particular, a scene-based image editing system performs many of the operations required for editing, without relying on user instructions to perform those operations. Thus, in many cases, scene-based image editing systems reduce the user interaction typically required in conventional systems to select pixels to be the editing target and navigate menus, submenus, or other windows to select a tool, select its corresponding parameters, and apply the tool to perform editing. By implementing a graphical user interface that reduces and simplifies the user interaction required to edit digital images, a scene-based image editing system provides an improved user experience on computing devices (e.g., tablet or smartphone devices) that have relatively limited screen space.

Additional details regarding the scene-based image editing system will now be provided with reference to the accompanying drawings. For example, FIG. 1 shows a schematic diagram of an exemplary system 100 in which a scene-based image editing system 106 operates. As shown in FIG. 1, system 100 includes server(s) 102, network 108, and client devices 110a-110n.

Although the system 100 of fig. 1 is depicted as having a particular number of components, the system 100 can have any number of additional or alternative components (e.g., any number of servers, client devices, or other components in communication with the scene-based image editing system 106 via the network 108). Similarly, while FIG. 1 shows a particular arrangement of server(s) 102, network 108, and client devices 110a-110n, various additional arrangements are possible.

Server(s) 102, network 108, and client devices 110a-110n are communicatively coupled to each other either directly or indirectly (e.g., through network 108 discussed in more detail below in connection with fig. 94). Further, server(s) 102 and client devices 110a-110n include one or more of a variety of computing devices (including one or more of the computing devices discussed in more detail with reference to FIG. 94).

As described above, the system 100 includes the server(s) 102. In one or more embodiments, server(s) 102 generate, store, receive, and/or transmit data including digital images and modified digital images. In one or more embodiments, server(s) 102 include a data server. In some implementations, the server(s) 102 include a communication server or a web hosting server.

In one or more embodiments, the image editing system 104 provides functionality for a client device (e.g., a user of one of the client devices 110a-110 n) to generate, edit, manage, and/or store digital images. For example, in some cases, the client device transmits the digital image to the image editing system 104 hosted on the server(s) 102 via the network 108. The image editing system 104 then provides options that the client device can use to edit the digital image, store the digital image, and then search for, access, and view the digital image. For example, in some cases, the image editing system 104 provides one or more options that a client device may use to modify objects within a digital image.

In one or more embodiments, client devices 110a-110n include computing devices that access, view, modify, store, and/or provide digital images for display. For example, client devices 110a-110n include smart phones, tablets, desktop computers, laptop computers, head mounted display devices, or other electronic devices. Client devices 110a-110n include one or more applications (e.g., client application 112) that may access, view, modify, store, and/or provide digital images for display. For example, in one or more embodiments, client application 112 includes a software application installed on client devices 110a-110 n. Additionally or alternatively, client application 112 includes a web browser or other application that accesses software applications hosted on server(s) 102 (and supported by image editing system 104).

To provide an example implementation, in some embodiments, the scene-based image editing system 106 on the server(s) 102 supports the scene-based image editing system 106 on the client device 110n. For example, in some cases, the scene-based image editing system 106 on the server(s) 102 learns parameters of the neural network(s) 114 for analyzing and/or modifying the digital image. The scene-based image editing system 106 then provides the neural network(s) 114 to the client device 110n via the server(s) 102. In other words, the client device 110n obtains (e.g., downloads) the neural network(s) 114 with the learned parameters from the server(s) 102. Once downloaded, the scene-based image editing system 106 on the client device 110n utilizes the neural network(s) 114 to analyze and/or modify the digital image independent of the server(s) 102.

In an alternative implementation, the scene-based image editing system 106 includes a web hosting application that allows the client device 110n to interact with content and services hosted on the server(s) 102. To illustrate, in one or more implementations, the client device 110n accesses a software application supported by the server(s) 102. In response, the scene-based image editing system 106 on the server(s) 102 modifies the digital image. Server(s) 102 then provide the modified digital image to client device 110n for display.

Indeed, the scene-based image editing system 106 can be implemented in whole or in part by the various elements of the system 100. Indeed, while FIG. 1 shows scene-based image editing system 106 implemented with respect to server(s) 102, the different components of scene-based image editing system 106 can be implemented by various devices within system 100. For example, one or more (or all) of the components of the scene-based image editing system 106 are implemented by a different computing device (e.g., one of the client devices 110a-110 n) or a separate server than the server(s) 102 hosting the image editing system 104. In fact, as shown in FIG. 1, client devices 110a-110n include scene-based image editing system 106. Example components of scene-based image editing system 106 are described below with reference to fig. 44.

As described above, in one or more embodiments, the scene-based image editing system 106 manages two-dimensional digital images as real scenes reflecting real world conditions. In particular, scene-based image editing system 106 implements a graphical use interface that facilitates modification of digital images to real scenes. FIG. 2 illustrates an overview of a scene-based image editing system 106 managing digital images as real scenes in accordance with one or more embodiments.

As shown in fig. 2, scene-based image editing system 106 provides a graphical user interface 202 for display on a client device 204. As further shown, the scene-based image editing system 106 provides a digital image 206 for display within the graphical user interface 202. In one or more embodiments, after capturing the digital image 206 via the camera of the client device 204, the scene-based image editing system 106 provides the digital image 206 for display. In some cases, the scene-based image editing system 106 receives the digital image 206 from another computing device or otherwise accesses the digital image 206 at some storage location, either local or remote.

As shown in fig. 2, the digital image 206 depicts various objects. In one or more embodiments, the object includes different visual components depicted in the digital image. In particular, in some embodiments, the object includes a different visual element that may be identified separately from other visual elements depicted in the digital image. In many cases, an object includes a set of pixels that together depict different visual elements that are separate from the depiction of other pixels. An object refers to a visual representation of a subject, concept, or sub-concept in an image. In particular, an object refers to a group of pixels in an image that combine together to form a visual depiction of an item, article, portion of an item, component, or element. In some cases, objects may be identified via different levels of abstraction. In other words, in some cases, an object includes individual object components that may be individually identified or as part of an aggregation. To illustrate, in some embodiments, the object includes a semantic region (e.g., sky, ground, water, etc.). In some embodiments, the object includes an instance of an identifiable thing (e.g., a person, an animal, a building, an automobile, or a cloud, clothing, or some other accessory). In one or more embodiments, an object includes a child object, component, or portion. For example, a human face, hair, or leg may be an object that is part of another object (e.g., a human body). In yet another implementation, the shadow or reflection includes a portion of the object. As another example, a shirt is an object that may be part of another object (e.g., a person).

As shown in fig. 2, the digital image 206 depicts a static two-dimensional image. In particular, the digital image 206 depicts a two-dimensional projection of a scene captured from the perspective of the camera. Thus, the digital image 206 reflects the condition of the captured image (e.g., lighting, ambient environment, or physical condition to which the depicted object is subjected); however, it is performed statically. In other words, the condition is not inherently maintained when changes are made to the digital image 206. Under many conventional systems, additional user interaction is required to maintain consistency with these conditions when editing digital images.

Further, the digital image 206 includes a plurality of individual pixels that collectively depict various semantic regions. For example, digital image 206 depicts a plurality of objects, such as objects 208a-208c. Although the pixels of each object contribute to the depiction of consecutive (visual) visual units, they are not generally considered to be consecutive visual units. In practice, a pixel of a digital image is typically inherently treated as a single cell with its own value (e.g., color value), which can be modified separately from the values of other pixels. Thus, conventional systems typically require user interaction to individually target pixels for modification when making changes to the digital image.

However, as shown in FIG. 2, the scene-based image editing system 106 manages the digital image 206 as a real scene, consistently maintaining the condition of capturing the image as the digital image is modified. In particular, the scene-based image editing system 106 automatically maintains conditions without relying on user input to reflect those conditions. In addition, the scene-based image editing system 106 manages the digital image 206 at a semantic level. In other words, the digital image 206 manages each semantic region depicted in the digital image 206 as a coherent unit (coherent). For example, as shown in FIG. 2 and discussed, no user interaction is required to select the underlying pixel to interact with the corresponding object, but rather the scene-based image editing system 106 enables user input to take the object as a unit, and the scene-based image editing system 106 automatically identifies the pixel associated with the object.

To illustrate, as shown in fig. 2, in some cases, the scene-based image editing system 106 operates on a computing device 200 (e.g., a client device 204 or a separate computing device, such as the server(s) 102 discussed above with reference to fig. 1) to pre-process the digital image 206. In particular, the scene-based image editing system 106 performs one or more preprocessing operations to anticipate future modifications to the digital image. In one or more embodiments, the scene-based image editing system 106 automatically performs these preprocessing operations in response to receiving or accessing the digital image 206 before user input for making the desired modification has been received. As further shown, the scene-based image editing system 106 utilizes one or more machine learning models, such as neural network(s) 114, to perform preprocessing operations.

In one or more embodiments, the scene-based image editing system 106 pre-processes the digital image 206 by learning characteristics of the digital image 206. For example, in some cases, scene-based image editing system 106 segments digital image 206, identifies objects, classifies objects, determines relationships and/or properties of objects, determines illumination characteristics, and/or determines depth/perspective characteristics. In some embodiments, scene-based image editing system 106 pre-processes digital image 206 by generating content for modifying digital image 206. For example, in some implementations, the scene-based image editing system 106 generates object masks for each depicted object and/or generates content fills for filling in the background behind each depicted object. The background refers to the content behind the object in the image. Thus, when the first object is positioned before the second object, the second object forms at least part of the background of the first object. Alternatively, the background contains the furthest elements in the image (typically semantic regions such as sky, ground, water, etc.). In one or more embodiments, the context of an object includes a plurality of object/semantic regions. For example, the background of an object may include a portion of another object and a portion of the furthest element in the image. Various preprocessing operations and their use in modifying digital images will be discussed in more detail below with reference to subsequent figures.

As shown in FIG. 2, scene-based image editing system 106 detects user interactions with object 208c via graphical user interface 202. In particular, scene-based image editing system 106 detects a user interaction for selecting object 208c. Indeed, in one or more embodiments, scene-based image editing system 106 determines that the user interaction targets an object, even where the user interaction interacts with only a subset of pixels of pre-processed contributing object 208c based on digital image 206. For example, as described above, in some embodiments, the scene-based image editing system 106 pre-processes the digital image 206 via segmentation. Thus, upon detecting user interaction, the scene-based image editing system 106 has partitioned/segmented the digital image 206 into its various semantic regions. Thus, in some cases, scene-based image editing system 106 determines that the user interaction selects a different semantic region (e.g., object 208 c) than the particular underlying pixel or image layer with which the user interacted.

As further shown in fig. 2, scene-based image editing system 106 modifies digital image 206 via modification to object 208c. Although FIG. 2 illustrates the deletion of object 208c, various modifications are possible and will be discussed in more detail below. In some embodiments, scene-based image editing system 106 edits object 208c in response to detecting a second user interaction to perform the modification.

As shown, upon deleting object 208c from digital image 206, scene-based image editing system 106 automatically displays background pixels that have been placed in the location of object 208 c. Indeed, as described above, in some embodiments, the scene-based image editing system 106 pre-processes the digital image 206 by generating a content fill for each depicted foreground object. Thus, as shown in FIG. 2, when object 208c is removed from digital image 206, scene-based image editing system 106 automatically exposes content fills 210 previously generated for object 208 c. In some cases, scene-based image editing system 106 locates content fill 210 within the digital image such that content fill 210 is exposed, rather than a hole occurring when object 208c is removed.

Thus, the scene-based image editing system 106 has improved flexibility compared to many conventional systems. In particular, the scene-based image editing system 106 implements flexible scene-based editing techniques in which digital images are modified to maintain a real scene of real world conditions (e.g., physical, environmental, or object relationships). Indeed, in the example shown in fig. 2, scene-based image editing system 106 utilizes pre-generated content fills to consistently maintain the context depicted in digital image 206 as if digital image 206 had captured the entirety of the context. Thus, the scene-based image editing system 106 enables the depicted objects to freely move around (or completely remove) without interrupting the scene depicted therein.

Furthermore, the scene-based image editing system 106 operates with improved efficiency. In effect, by segmenting digital image 206 and generating content pad 210 in anticipation of modifications to remove object 208c from its location in digital image 206, scene-based image editing system 106 reduces user interactions that would normally be required to perform those same operations under conventional systems. Thus, the scene-based image editing system 106 is able to implement the same modification to the digital image with less user interaction than these conventional systems.

As described above, in one or more embodiments, scene-based image editing system 106 implements object-aware image editing on digital images. In particular, scene-based image editing system 106 implements object-aware modification of objects as interactable and modifiable coherent units (coherence units). Fig. 3-9B illustrate a scene-based image editing system 106 implementing object-aware modifications in accordance with one or more embodiments.

In fact, many conventional image editing systems are inflexible and inefficient in interacting with objects depicted in digital images. For example, as previously mentioned, conventional systems are typically inflexible in that they require user interaction to target separate pixels, rather than objects depicted by those pixels. Thus, such systems typically require a strict and careful procedure to select the pixels to be modified. Furthermore, when objects are identified via user selection, these systems often fail to predict and prepare for potential editing of these objects.

In addition, many conventional image editing systems require a significant amount of user interaction to modify objects depicted in the digital image. Indeed, in addition to the pixel selection process for identifying objects in a digital image, which may itself require a series of user interactions, conventional systems may require a considerable workflow in which a user interacts with multiple menus, sub-menus, tools and/or windows to perform editing. Typically, performing editing on an object requires multiple preparatory steps before the required editing can be performed, which requires additional user interaction.

Scene-based image editing system 106 provides advantages over these systems. For example, the scene-based image editing system 106 provides improved flexibility in image editing via object perception. In particular, scene-based image editing system 106 implements object-level interactions-rather than pixel-level or hierarchical interactions-to facilitate user interactions that target objects depicted as coherent units directly rather than their separate constituent pixels.

Furthermore, the scene-based image editing system 106 improves the efficiency of interaction with objects depicted in the digital image. In fact, as previously described, and as will be discussed further below, the scene-based image editing system 106 implements preprocessing operations for identifying and/or segmenting depicted objects in anticipation of modifications to those objects. Indeed, in many cases, the scene-based image editing system 106 performs these preprocessing operations without receiving user interactions for those modifications. Thus, the scene-based image editing system 106 reduces the user interaction required to perform a given edit on the depicted object.

In some embodiments, the scene-based image editing system 106 implements object-aware image editing by generating object masks (object masks) for each object/semantic region depicted in the digital image. Specifically, in some cases, the scene-based image editing system 106 utilizes a machine learning model, such as a segmented neural network, to generate the object mask(s). FIG. 3 illustrates a segmented neural network used by a scene-based image editing system 106 to generate object masks for objects in accordance with one or more embodiments.

In one or more embodiments, the object mask includes a map of the digital image with, for each pixel, an indication of whether the pixel corresponds to a portion (or other semantic region) of the object. In some implementations, the indication includes a binary indication (e.g., the indication is a "1" for pixels belonging to the object and a "0" for pixels not belonging to the object). In an alternative implementation, the indication includes a probability (e.g., a number between 1 and 0) that indicates the likelihood that the pixel belongs to the object. In such an implementation, the closer the value is to 1, the more likely a pixel is to belong to an object, and vice versa.

In one or more embodiments, the machine learning model includes a computer representation that is tunable (e.g., trained) based on inputs to approximate an unknown function used to generate the respective outputs. Specifically, in some embodiments, the machine learning model includes a computer-implemented model that utilizes algorithms to learn and predict known data by analyzing the known data to learn to generate an output reflecting patterns and attributes of the known data. For example, in some cases, the machine learning model includes, but is not limited to, a neural network (e.g., convolutional neural network, recurrent neural network, or other deep learning network), a decision tree (e.g., gradient-enhanced decision tree), association rule learning, inductive logic programming, support vector learning, bayesian network, regression-based model (e.g., erasure regression), principal component analysis, or a combination thereof.

In one or more embodiments, the neural network includes a model of interconnected artificial neurons (e.g., organized in layers) that communicate and learn approximately complex functions, and generate outputs based on a plurality of inputs provided to the model. In some cases, the neural network includes one or more machine learning algorithms. Furthermore, in some cases, the neural network includes an algorithm (or set of algorithms) that implements a deep learning technique that utilizes a set of algorithms to model high-level abstractions in the data. To illustrate, in some embodiments, the neural network includes a convolutional neural network, a recurrent neural network (e.g., a long and short term memory neural network), a generative antagonistic neural network, a graph neural network, or a multi-layer perceptron. In some embodiments, the neural network comprises a neural network or a combination of neural network components.

In one or more embodiments, the segmented neural network includes a computer-implemented neural network that generates object masks for objects depicted in the digital image. In particular, in some embodiments, the segmented neural network includes a computer-implemented neural network that detects objects within the digital image and generates object masks for the objects. Indeed, in some implementations, the segmented neural network includes a neural network pipeline that analyzes the digital image, identifies one or more objects depicted in the digital image, and generates object masks for the one or more objects. However, in some cases, the segmented neural network focuses on a subset of the tasks used to generate the object mask.

As described above, FIG. 3 illustrates one example of a segmented neural network that a scene-based image editing system 106 uses in one or more implementations to generate object masks for objects depicted in a digital image. In particular, FIG. 3 illustrates one example of a segmented neural network that scene-based image editing system 106 uses to detect objects in a digital image and generate object masks for the objects in some embodiments. In effect, fig. 3 shows a detection masking neural network 300 that includes an object detection machine learning model 308 (in the form of an object detection neural network) and an object segmentation machine learning model 310 (in the form of an object segmentation neural network). Specifically, detecting masking neural network 300 is an implementation of an on-device masking system described in U.S. patent application Ser. No. 17/589,114, filed on 1/2022, 31, the entire contents of which are incorporated herein by reference.

Although fig. 3 illustrates a scene-based image editing system 106 that utilizes a detection masking neural network 300, in one or more implementations, the scene-based image editing system 106 utilizes a different machine learning model to detect objects, generate object masks for objects, and/or extract objects from digital images. For example, in one or more implementations, the scene-based image editing system 106 utilizes one of the machine learning models or neural networks described in the following documents as (or as an alternative to) the segmented neural network: U.S. patent application Ser. No. 17/158,527, entitled "Segmenting Objects In Digital Images Utilizing A Multi-Object Segmentation Model Framework" (digital image object segmentation based on Multi-object segmentation model framework), filed on 1, 26, 2021; or U.S. patent application Ser. No. 16/388,115, entitled "Robust Training of Large-Scale Object Detectors with Noisy Data" (robust training of large scale target detectors with noisy data) filed on 4/8/2019; or U.S. patent application Ser. No. 16/518,880 entitled "Utilizing Multiple Object Segmentation Models To Automatically Select User-Requested Objects In Images" (automatic selection of user-requested objects in an image using a multi-target segmentation model) filed on 7/22/2019; or U.S. patent application Ser. No. 16/817,418, entitled "Utilizing A Large-Scale Object Detector To Automatically Select Objects In Digital Images" (automatic selection of targets in digital images with a large-scale target detector) filed 3/20/2020; or Ren et al, "fast r-cnn: towards real-time object detection with region proposal networks" (Faster r-cnn: real-time target detection using regional advice networks), NIPS,2015; or Redmon et al, "You Only Look Once: unified, real-Time Object Detection" (you need only see once: unified Real-time object detection), CVPR 2016, the contents of each of the foregoing applications and papers are incorporated herein by reference in their entirety.

Similarly, in one or more implementations, the scene-based image editing system 106 utilizes one of the machine learning models or neural networks described in the following documents as (or as an alternative to) the segmented neural network: ning Xu et al, "Deep GrabCut for Object Selection" (deep grip cut for object selection), published in 2017, 7, 14; or U.S. patent application publication No. 2019/013129 filed on 10/31 in 2017 entitled "Deep Salient Content Neural Networks for Efficient Digital Object Segmentation" (application of deep prominent content neural network in digital object segmentation); or U.S. patent application Ser. No. 16/035,410 entitled "Automatic Trimap Generation and Image Segmentation" (trigeminal automatic generation and image segmentation) filed on 7/13/2018; or U.S. patent No. 10,192,129, entitled "Utilizing Interactive Deep Learning To Select Objects In Digital Visual Media" (selecting objects in digital visual media using interactive deep learning), filed on 11/18/2015, each of which is incorporated herein by reference in its entirety.

In one or more implementations, the segmented neural network is a panoramic segmented neural network. In other words, the split neural network creates object masks for individual instances of a given object type. Further, in one or more implementations, the segmented neural network generates object masks for semantic regions (e.g., water, sky, sand, dirt, etc.), except for countless things. Indeed, in one or more implementations, the scene-based image editing system 106 utilizes one of the machine learning models or neural networks described in the following documents as (or as an alternative to) the segmented neural network: U.S. patent application Ser. No. 17/495,618 entitled "PANOPTIC SEGMENTATION REFINEMENT NETWORK" (panoramic segmentation refinement network) filed on day 10 and 2 of 2021; or U.S. patent application Ser. No. 17/454,740 entitled "MULTI-SOURCE PANOPTIC FEATURE PYRAMID NETWORK" (Multi-source panoramic feature pyramid network), filed 11/12/2021, each of which is incorporated herein by reference in its entirety.

Returning now to fig. 3, in one or more implementations, the scene-based image editing system 106 utilizes a detection masking neural network 300, the detection masking neural network 300 including an encoder 302 (or neural network encoder) having a backbone network, a detection head (or neural network decoder head) 304, and a masking head (or neural network decoder head) 306. As shown in fig. 3, encoder 302 encodes digital image 316 and provides the encoding to detection head 304 and masking head 306. The detection head 304 utilizes encoding to detect one or more objects depicted in the digital image 316. The mask header 306 generates at least one object mask for the detected object.

As described above, the detection masking neural network 300 utilizes both the object detection machine learning model 308 and the object segmentation machine learning model 310. In one or more implementations, the object detection machine learning model 308 includes both the encoder 302 and the detection head 304 shown in fig. 3. While object segmentation machine learning model 310 includes both encoder 302 and masking head 306. Further, the object detection machine learning model 308 and the object segmentation machine learning model 310 are separate machine learning models for processing objects within the target and/or source digital images. Fig. 3 shows an encoder 302, a detection head 304, and a masking head 306 as a single model of an object for detecting and segmenting a digital image. For efficiency purposes, in some embodiments, scene-based image editing system 106 uses the network shown in FIG. 3 as a single network. The aggregate network (i.e., the object detection machine learning model 308 and the object segmentation machine learning model 310) is referred to as the detection masking neural network 300. The following paragraphs describe components related to the object detection machine learning model 308 of the network (such as the detection head 304) and transition to discuss components related to the object segmentation machine learning model 310.

As described above, in one or more embodiments, scene-based image editing system 106 utilizes object detection machine learning model 308 to detect and identify objects (e.g., target or source digital images) within digital image 316. FIG. 3 illustrates one implementation of an object detection machine learning model 308 used by the scene-based image editing system 106 in accordance with at least one embodiment. In particular, FIG. 3 illustrates a scene-based image editing system 106 that utilizes an object detection machine learning model 308 to detect objects. In one or more embodiments, the object detection machine learning model 308 includes a deep learning Convolutional Neural Network (CNN). For example, in some embodiments, the object detection machine learning model 308 includes a region-based (R-CNN).

As shown in fig. 3, the object detection machine learning model 308 includes a lower neural network layer and an upper neural network layer. Typically, lower neural network layers collectively form an encoder 302, while higher neural network layers collectively form a detection head 304 (e.g., a decoder). In one or more embodiments, encoder 302 includes a convolutional layer that encodes the digital image into feature vectors that are output from encoder 302 and provided as inputs to detection head 304. In various implementations, the detection head 304 includes a fully connected layer that analyzes feature vectors and outputs detected objects (possibly with approximate boundaries around the objects).

In particular, in one or more implementations, encoder 302 includes a convolutional layer that generates feature vectors in the form of feature maps. To detect objects within the digital image 316, the object detection machine learning model 308 processes the feature map with a convolution layer in the form of a small network that slides across the small windows of the feature map. The object detection machine learning model 308 also maps each sliding window to a low-dimensional feature. In one or more embodiments, the object detection machine learning model 308 processes the feature using two separate detection heads as fully connected layers. In some embodiments, the first head includes a box-regression layer (box-regression layer) that generates the detected objects and an object classification layer that generates object tags.

As shown in fig. 3, the output from the detection head 304 displays the object tag above each detected object. For example, the detection masking neural network 300 assigns an object tag to each detected object in response to detecting the object. Specifically, in some embodiments, the detection masking neural network 300 utilizes object tags based on object classification. For illustration, fig. 3 shows a tag 318 for women, a tag 320 for birds, and a tag 322 for men. Although not shown in fig. 3, in some implementations, the detection masking neural network 300 further distinguishes between women and surfboards held by women. Furthermore, the detection masking neural network 300 also optionally generates object masks for the illustrated semantic regions (e.g., sand, sea, and sky).

As described above, the object detection machine learning model 308 detects objects within the digital image 316. In some embodiments, and as shown in fig. 3, detection masking neural network 300 indicates the detected object with approximate boundaries (e.g., bounding boxes 319, 321, and 323). For example, each bounding box includes an area surrounding the object. In some embodiments, the detection masking neural network 300 annotates the bounding box with the previously mentioned object tags, such as the name of the detected object, coordinates of the bounding box, and/or the size of the bounding box.

As shown in fig. 3, the object detection machine learning model 308 detects several objects of the digital image 316. In some cases, detecting masking neural network 300 identifies all objects within the bounding box. In one or more embodiments, the bounding box includes an approximate bounding region that indicates the detected object. In some cases, an approximate boundary refers to an indication of a region that includes objects that are larger and/or less precise than the object mask. In one or more embodiments, the approximate boundary includes at least a portion of the detected object and a portion of the digital image 316 that does not include the detected object. The approximate boundaries include various shapes, such as square, rectangular, circular, oval, or other contours around the object. In one or more embodiments, the approximate boundary includes a bounding box.

Upon detecting an object in the digital image 316, the detection masking neural network 300 generates an object mask for the detected object. In general, rather than using a rough bounding box during object localization, the detection masking neural network 300 generates a segmentation mask that better defines the boundary of the object. The following paragraphs provide additional details regarding generating object masks for detected objects in accordance with one or more embodiments. In particular, FIG. 3 illustrates that the scene-based image editing system 106 utilizes an object segmentation machine learning model 310 to generate segmented objects via object masking, according to some embodiments.

As shown in fig. 3, the scene-based image editing system 106 processes the objects detected in the bounding box using the object segmentation machine learning model 310 to generate object masks, such as object mask 324 and object mask 326. In an alternative embodiment, the scene-based image editing system 106 utilizes the object detection machine learning model 308 itself to generate object masks for the detected objects (e.g., to segment the objects for selection).

In one or more implementations, prior to generating the object mask of the detected object, the scene-based image editing system 106 receives user input 312 to determine the object for which the object mask is to be generated. For example, the scene-based image editing system 106 receives input from a user indicating a selection of one of the detected objects. To illustrate, in the illustrated implementation, the scene-based image editing system 106 receives user input 312 that a user selects bounding box 321 and bounding box 323. In an alternative implementation, the scene-based image editing system 106 automatically generates object masks for each object (e.g., without a user request indicating the object to be selected).

As described above, the scene-based image editing system 106 processes bounding boxes of objects detected in the digital image 316 using the object segmentation machine learning model 310. In some embodiments, the bounding box includes an output from the object detection machine learning model 308. For example, as shown in FIG. 3, the bounding box includes a rectangular border around the object. Specifically, fig. 3 shows bounding boxes 319, 321, and 323 around women, birds, and men detected in digital image 316.

In some embodiments, the scene based image editing system 106 utilizes the object segmentation machine learning model 310 to generate object masks for the aforementioned detected objects within the bounding box. For example, the object segmentation machine learning model 310 corresponds to one or more deep neural networks or models that select an object based on bounding box parameters corresponding to the object within the digital image 316. Specifically, object segmentation machine learning model 310 generates object masks 324 and 326 for detected men and birds, respectively.

In some embodiments, the scene-based image editing system 106 selects the object segmentation machine learning model 310 based on object tags of objects identified by the object detection machine learning model 308. Generally, based on identifying one or more object categories associated with the input bounding box, the scene-based image editing system 106 selects an object segmentation machine learning model that is tuned to generate object masks for the identified one or more categories of objects. To illustrate, in some embodiments, based on determining that the one or more identified categories of objects include humans or people, the scene-based image editing system 106 utilizes a special human object masking neural network to generate an object mask, such as the object mask 324 shown in fig. 3.

As further shown in fig. 3, the scene-based image editing system 106 receives object masks 324 and 326 as output from the object segmentation machine learning model 310. As previously described, in one or more embodiments, the object mask includes a pixel-level mask corresponding to an object in the source or target digital image. In one example, object masking includes a prediction edge indicating one or more objects and a segmentation boundary of pixels contained within the prediction edge.

In some embodiments, scene-based image editing system 106 also detects objects shown in digital image 316 via the collective network, i.e., detection masking neural network 300, in the same manner as described above. For example, in some cases, scene-based image editing system 106 detects women, men, and birds within digital image 316 via detection masking neural network 300. Specifically, scene-based image editing system 106 utilizes feature pyramids (feature features pyramids) and feature maps to identify objects within digital image 316 via detection head 304, and generates object masks via mask head 306.

Further, in one or more implementations, although FIG. 3 illustrates generating object masks based on user input 312, scene-based image editing system 106 generates object masks without user input 312. Specifically, scene-based image editing system 106 generates object masks for all detected objects within digital image 316. To illustrate, in at least one implementation, the scene-based image editing system 106 generates object masks for women, men, and birds despite not receiving the user input 312.

In one or more embodiments, the scene-based image editing system 106 implements object-aware image editing by generating content fills for each object depicted in the digital image (e.g., for each object mask corresponding to the depicted object) using a hole-filling model. Specifically, in some cases, scene-based image editing system 106 utilizes a machine learning model, such as a hole-filling machine learning model of content perception, to generate content fill(s) for each foreground object. Fig. 4-6 illustrate a hole-filling machine learning model used by a scene-based image editing system 106 to generate content-filled content-aware holes for objects in accordance with one or more embodiments.

In one or more embodiments, the content fill includes a set of pixels generated for replacing another set of pixels of the digital image. Indeed, in some embodiments, the content fill includes a set of replacement pixels for replacing another set of pixels. For example, in some embodiments, the content fill includes a set of pixels generated to fill a hole (e.g., a content blank) left after (or if) a set of pixels (e.g., a set of pixels depicting an object) has been removed from or moved within the digital image. In some cases, the content fills the background corresponding to the digital image. To illustrate, in some implementations, the content fill includes a generated set of pixels to mix with a portion of the background near the object that may be moved/removed. In some cases, the content fill includes a repair segment, such as a repair segment generated from other pixels (e.g., other background pixels) within the digital image. In some cases, the content fill includes other content (e.g., arbitrarily selected content or content selected by a user) to fill a hole or replace another set of pixels.

In one or more embodiments, the content-aware hole-filling machine learning model includes a computer-implemented machine learning model that generates content fills. In particular, in some embodiments, the content-aware hole-filling machine learning model includes a computer-implemented machine learning model that generates content fills for replacement regions in a digital image. For example, in some cases, scene-based image editing system 106 determines that an object has moved within or removed from a digital image and utilizes a content-aware hole filling machine learning model to generate a content fill for holes exposed as a result of the responsive movement/removal. However, as will be discussed in more detail, in some implementations, the scene-based image editing system 106 predicts movement or removal of an object and populates the machine learning model with content-aware holes to pre-generate content fills for the object. In some cases, the hole-filling machine learning model of content perception includes a neural network, such as a repair neural network (e.g., a neural network that uses other pixels of a digital image to generate content fills-more specifically repair segments). In other words, in various implementations, the scene-based image editing system 106 utilizes the content-aware hole-filling machine learning model to provide content at a location of the digital image that does not initially depict such content (e.g., because the location is occupied by another semantic region such as an object).

FIG. 4 illustrates a scene-based image editing system 106 generating a repair digital image 408 from a digital image 402 having a replacement region 404 using a content-aware machine learning model such as a cascade modulation repair neural network (cascaded modulation inpainting neural network) 420, in accordance with one or more embodiments.

In fact, in one or more embodiments, the replacement area 404 includes an area corresponding to the object (and a hole that would exist if the object were moved or deleted). In some embodiments, scene-based image editing system 106 identifies replacement area 404 based on a user's selection of pixels (e.g., pixels depicting objects) to be moved, removed, overlaid, or replaced from the digital image. To illustrate, in some cases, a client device selects an object depicted in a digital image. Thus, scene-based image editing system 106 deletes or removes objects and generates replacement pixels. In some cases, the scene-based image editing system 106 identifies the replacement region 404 by generating object masks via a segmented neural network. For example, the scene-based image editing system 106 utilizes a segmented neural network (e.g., the detection mask neural network 300 discussed above with reference to fig. 3) to detect an object having a digital image and generate an object mask for the object. Thus, in some implementations, the scene-based image editing system 106 generates content fills for the replacement area 404 before receiving user input to move, remove, overlay, or replace pixels that originally occupied the replacement area 404.

As shown, scene-based image editing system 106 utilizes cascade modulation repair neural network 420 to generate replacement pixels for replacement region 404. In one or more embodiments, cascade modulation repair neural network 420 includes a generative antagonistic neural network for generating replacement pixels. In some embodiments, generating a sexual countermeasure neural network (or "GAN") includes a neural network that is tuned or trained via a sexual process to generate an output digital image (e.g., from an input digital image). In some cases, the generative countermeasure neural network includes a plurality of constituent neural networks, such as an encoder neural network and one or more decoder/generator neural networks. For example, the encoder neural network extracts potential codes from noise vectors or digital images. The generator neural network (or a combination of generator neural networks) generates a modified digital image by combining the extracted potential codes (e.g., from the encoder neural network). During training, the discriminator neural network competes with the generator neural network to analyze the generated digital image to generate an authenticity prediction by determining whether the generated digital image is authentic (e.g., from a set of stored digital images) or false (e.g., not from a set of stored digital images). The discriminator neural network also causes the scene-based image editing system 106 to modify parameters of the encoder neural network and/or the one or more generator neural networks to ultimately generate a digital image that spoofs the discriminator neural network to indicate that the generated digital image is a true digital image.

Along these lines, a generative antagonistic neural network refers to a neural network having a specific architecture or a specific purpose, such as a generative repair neural network. For example, the generative repair neural network includes a generative countermeasure neural network that repairs or populates pixels of the digital image with content augmentation (or generates content augmentation to anticipate repairing or populating pixels of the digital image). In some cases, the generative repair neural network repairs the digital image by filling in the hole region (indicated by the object mask). Indeed, as described above, in some embodiments, object masking uses segmentation or masking to define replacement regions that indicate, overlap, overlay, or delineate pixels to be removed or replaced within the digital image.

Thus, in some embodiments, the cascade modulation repair neural network 420 includes a generative repair neural network that utilizes a decoder having one or more cascade modulation decoder layers. In effect, as shown in fig. 4, the cascade modulation repair neural network 420 includes a plurality of cascade modulation decoder layers 410, 412, 414, 416. In some cases, the concatenated modulation decoder layer includes at least two concatenated (e.g., concatenated) modulation blocks for modulating an input signal when generating the repair digital image. To illustrate, in some cases, the concatenated modulation decoder layer includes a first global modulation block and a second global modulation block. Similarly, in some cases, the concatenated modulation decoder layer includes a first global modulation block (which analyzes global features and utilizes a global spatially invariant method) and a second spatial modulation block (which analyzes local features utilizing a spatially varying method). Additional details regarding the modulation blocks (e.g., regarding fig. 5-6) will be provided below.

As shown, scene-based image editing system 106 utilizes a cascade modulation repair neural network 420 (and cascade modulation decoder layers 410, 412, 414, 416) to generate repair digital image 408. In particular, cascade modulation repair neural network 420 generates repair digital image 408 by generating a content fill for replacement region 404. As shown, the replacement area 404 is now filled with content fills having replacement pixels that depict photo-level realism (photo-realistic) scenes that replace the replacement area 404.

As described above, the scene-based image editing system 106 generates a repair digital image using a cascade modulation repair neural network that includes a cascade modulation decoder layer. FIG. 5 illustrates an example architecture of a cascaded modulation repair neural network 502 in accordance with one or more embodiments.

As shown, the cascade modulation repair neural network 502 includes an encoder 504 and a decoder 506. In particular, encoder 504 includes a plurality of convolutional layers 508a-508n of different scales/resolutions. In some cases, the scene-based image editing system 106 feeds a digital image input 510 (e.g., encoding of a digital image) to the first convolution layer 508A to generate encoded feature vectors of a higher scale (e.g., lower resolution). The second convolution layer 508b processes the encoded feature vectors at a higher scale (lower resolution) and generates additional encoded feature vectors (at another higher scale/lower resolution). The cascade modulation repair neural network 502 iteratively generates these encoded feature vectors until a final/highest scale convolution layer 508n is reached and generates a final encoded feature vector representation of the digital image.

As shown, in one or more embodiments, the cascade modulation repair neural network 502 generates a global signature (global feature code) from the final encoded feature vector of the encoder 504. Global feature codes include characteristic representations of digital images from a global (e.g., high-level, high-scale, low-resolution) perspective. In particular, the global feature code comprises a representation of a digital image reflecting the coded feature vector at the highest scale/lowest resolution (or the different coded feature vectors satisfying the threshold scale/resolution).

As shown, in one or more embodiments, the cascade modulation repair neural network 502 applies a neural network layer (e.g., a fully connected layer) to the final encoded feature vector to generate a pattern code 512 (e.g., a pattern vector). In addition, the cascade modulation repair neural network 502 generates the global feature code by combining the pattern code 512 with the random pattern code 514. Specifically, the cascade modulation repair neural network 502 generates the random pattern code 514 by processing the input noise vector with a neural network layer (e.g., a multi-layer perceptron). The neural network layer maps the input noise vector to a random pattern code 514. The cascade modulation repair neural network 502 combines (e.g., concatenates, adds, or multiplies) the random pattern code 514 with the pattern code 512 to generate a global feature code 516. Although fig. 5 illustrates a particular method of generating global feature codes 516, scene-based image editing system 106 can utilize a variety of different methods to generate global feature codes (e.g., without pattern codes 512 and/or random pattern codes 514) that represent encoded feature vectors of encoder 504.

As described above, in some embodiments, the cascade modulation repair neural network 502 generates image encoding using the encoder 504. Image coding is the coded representation of an index image. Thus, in some cases, the image encoding includes one or more encoding feature vectors, style codes, and/or global feature codes.

In one or more embodiments, the cascade modulation repair neural network 502 utilizes multiple fourier convolutional encoder layers to generate image encodings (e.g., encoding feature vectors, pattern codes 512, and/or global feature codes 516). For example, a fourier convolutional encoder layer (or fast fourier convolution) includes a convolutional layer that includes a non-local receptive field and a cross-scale fusion within the convolutional elements. In particular, the fast fourier convolution may include three computations in a single arithmetic unit: a local branch that performs small-kernel convolution, a semi-global branch that processes spectrum-stacked image blocks, and a global branch that manipulates image-level spectra. These three branches complement each other in terms of the different dimensions. Furthermore, in some cases, the fast fourier convolution includes a multi-branch aggregation process for cross-scale fusion. For example, in one or more embodiments, the cascade modulation repair neural network 502 uses a fast fourier convolution layer, as described in Lu Chi, borui Jiang, and Yadong Mu, "Fast Fourier convolution, advances in Neural Information Processing Systems" (fast fourier convolution, new development of neural information processing systems), 33 (2020), which is incorporated herein by reference in its entirety.

Specifically, in one or more embodiments, the cascade modulation repair neural network 502 uses a Fourier convolutional encoding layer for each of the encoder convolutional layers 508a-508 n. Thus, the cascade modulation repair neural network 502 utilizes different fourier convolutional encoding layers with different scales/resolutions to generate encoded feature vectors with improved non-local perceptual fields.

The operation of the encoder 504 may also be described in terms of variables or equations to demonstrate the function of the cascade modulation repair neural network 502. For example, as described above, the cascade modulation repair neural network 502 is an encoder-decoder network with the proposed cascade modulation blocks for image repair in its decoding stage. Specifically, the cascade modulation repair neural network 502 starts with an encoder E having a partial image and masking as inputs to produce a multi-scale feature map from input resolution to resolution 4×4:

wherein the method comprises the steps ofIs a feature generated at a scale 1.ltoreq.i.ltoreq.L (with L being the highest scale or resolution). The encoder is implemented by a set of 2-stride convolutions with residual connections.

In generating the highest scale featuresThereafter, the full connection layer is followed by l ₂ Normalization will produce global style codes To represent global inputs. In parallel with the encoder, the MLP-based mapping network generates a random pattern code w from the normalized random gaussian noise z, thereby simulating the randomness of the generation process. Furthermore, the scene-based image editing system 106 combines w with S to produce the final global title g= [ S; w (w)]For decoding. As described above, in some embodiments, the scene-based image editing system 106 utilizes the final global title as an image encoding for the digital image.

As described above, in some implementations, the full convolution model is affected by a slow increase in the effective perceived field, especially at an early stage of the network. Thus, using a stride convolution within the encoder may generate invalid features within the hole region, making feature correction at the decoding stage more challenging. Fast Fourier Convolution (FFC) can help the early layers obtain a perceived field that covers the entire image. However, conventional systems use FFCs only at the bottleneck layer, which is computationally demanding. Furthermore, the shallow bottleneck layer cannot effectively capture global semantic features. Thus, in one or more implementations, the scene-based image editing system 106 replaces the convolution blocks in the encoder with FFCs of the encoding layer. The FFC enables the encoder to propagate features at an early stage, solving the problem of creating invalid features in the hole, which helps improve the results.

As further shown in fig. 5, the cascade modulation repair neural network 502 also includes a decoder 506. As shown, decoder 506 includes a plurality of cascaded modulation layers 520a-520n. The cascade modulation layers 520a-520n process the input features (e.g., the input global feature map and the input local feature map) to generate new features (e.g., a new global feature map and a new local feature map). Specifically, each of the cascaded modulation layers 520a-520n operates at a different scale/resolution. Thus, the first cascaded modulation layer 520a takes input features at a first resolution/scale and generates new features at a lower scale/higher resolution (e.g., through upsampling as part of one or more modulation operations). Similarly, the additional cascaded modulation layers operate at lower scales/higher resolutions until a restored digital image is generated at the output scales/resolutions (e.g., lowest scale/highest resolution).

Further, each of the cascaded modulation layers includes a plurality of modulation blocks. For example, with respect to fig. 5, the first cascaded modulation layer 520a includes global modulation blocks and spatial modulation blocks. Specifically, the cascade modulation repair neural network 502 performs global modulation with respect to the input features of the global modulation block. In addition, the cascade modulation repair neural network 502 performs spatial modulation with respect to the input features of the spatial modulation block. By performing global and spatial modulation within each cascaded modulation layer, scene-based image editing system 106 refines the global position to generate a more accurate repair digital image.

As shown, the cascaded modulation layers 520a-520n are cascaded in that the global modulation block is fed into the spatial modulation block. Specifically, the cascade modulation repair neural network 502 performs spatial modulation at a spatial modulation block based on the features generated at the global modulation block. To illustrate, in one or more embodiments, the cascade modulation repair neural network 502 utilizes global modulation blocks to generate intermediate features. The cascade modulation repair neural network 502 also utilizes a convolutional layer (e.g., a 2-layer convolutional affine parameter network) to convert the intermediate features into spatial tensors. The cascade modulation repair neural network 502 utilizes spatial tensors to modulate the input features analyzed by the spatial modulation block.

For example, fig. 6 provides additional details regarding the operation of the global and spatial modulation blocks in accordance with one or more embodiments. Specifically, fig. 6 shows a global modulation block 602 and a spatial modulation block 603. As shown in fig. 6, the global modulation block 602 includes a first global modulation operation 604 and a second global modulation operation 606. In addition, spatial modulation block 603 includes global modulation operation 608 and spatial modulation operation 610.

For example, the modulation block (or modulation operation) includes a computer-implemented process for modulating (e.g., scaling or shifting) an input signal according to one or more conditions (conditions). To illustrate, the modulation block includes amplifying certain features while counteracting/normalizing these amplifications to maintain operation within the generative model. Thus, for example, in some cases, the modulation block (or modulation operation) includes a modulation layer, a convolution layer, and a normalization layer. The modulation layer scales each input feature of the convolution and the normalization removes the scaled effect of the statistics from the convolved output feature map.

In practice, because the modulation layer modifies the feature statistics, the modulation block (or modulation operation) typically includes one or more methods for accounting for these statistical variations. For example, in some cases, the modulation block (or modulation operation) includes a computer-implemented process that normalizes the features using batch normalization or instance normalization. In some embodiments, modulation is achieved by scaling and shifting the normalized activation according to affine parameters predicted from the input conditions. Similarly, some modulation processes replace feature normalization with demodulation processes. Thus, in one or more embodiments, the modulation block (or modulation operation) includes a modulation layer, a convolution layer, and a demodulation layer. For example, in one or more embodiments, the modulation block (or modulation operation) includes a modulation method described by Tero Karras, samuli Laine, miika Aittala, janne Hellsten, jaakko Lehtin, and Timo Aila in "Analyzing and improving the image quality of StyleGA" (analysis and improvement of StyleGA image quality), CVPR (2020) (hereinafter StyleGan 2), which is incorporated herein by reference in its entirety. In some cases, the modulation block includes one or more modulation operations.

Further, in one or more embodiments, the global modulation block (or global modulation operation) comprises a modulation block (or modulation operation) that modulates the input signal in a spatially invariant manner. For example, in some embodiments, the global modulation block (or global modulation operation) performs modulation according to global features of the digital image (e.g., without spatially varying across the coordinates of the feature map or image). Thus, for example, a global modulation block comprises a modulation block that modulates an input signal according to an image code (e.g., global feature code) generated by an encoder. In some implementations, the global modulation block includes a plurality of global modulation operations.

In one or more embodiments, the spatial modulation block (or spatial modulation operation) includes a modulation block (or modulation operation) that modulates the input signal in a spatially varying manner (e.g., according to a spatially varying signature). In particular, in some embodiments, the spatial modulation block (or spatial modulation operation) modulates the input signal in a spatially varying manner using a spatial tensor. Thus, in one or more embodiments, the global modulation block applies global modulation in which affine parameters are consistent in spatial coordinates, and the spatial modulation block applies spatially varying affine transforms that vary in spatial coordinates. In some embodiments, the spatial modulation block includes spatial modulation operations (e.g., global modulation operations and spatial modulation operations) combined with another modulation operation.

For example, in some embodiments, the spatial modulation operations include spatial adaptive modulation, as described in "Semantic image synthesis with spatially-adaptive normalization" (semantic image synthesis based on spatial adaptive normalization), CVPR (2019), which is incorporated herein by reference in its entirety (hereinafter referred to as Taesung), such as Taesung Park, ming-Yu Liu, ting-Chun Wang, and Jun-Yan Zhu. In some embodiments, the spatial modulation operation utilizes a spatial modulation operation having a different architecture than Taesung, including a modulation-convolution-demodulation pipeline.

Thus, with respect to FIG. 6, scene-based image editing system 106 utilizes global modulation block 602. As shown, the global modulation block 602 includes a first global modulation operation 604 and a second global modulation operation 606. Specifically, the first global modulation operation 604 processes the global feature map 612. For example, global feature map 612 includes feature vectors generated by cascade modulation repair neural networks that reflect global features (e.g., high-level features or features corresponding to an entire digital image). Thus, for example, global feature map 612 includes feature vectors reflecting global features generated from previous global modulation blocks of the concatenated decoder layer. In some cases, global feature map 612 also includes feature vectors corresponding to the encoded feature vectors generated by the encoder (e.g., at the first decoder layer, in various implementations, scene-based image editing system 106 utilizes encoded feature vectors, style codes, global feature codes, constants, noise vectors, or other feature vectors as inputs).

As shown, the first global modulation operation 604 includes a modulation layer 604a, an upsampling layer 604b, a convolution layer 604c, and a normalization layer 604d. In particular, scene-based image editing system 106 utilizes modulation layer 604a to perform global modulation of global feature map 612 based on global feature map 614 (e.g., global feature map 516). In particular, scene-based image editing system 106 applies a neural network layer (i.e., a fully connected layer) to global feature codes 614 to generate global feature vectors 616. The scene-based image editing system 106 then modulates the global feature map 612 with the global feature vector 616.

In addition, the scene-based image editing system 106 applies the upsampling layer 604b (e.g., to modify the resolution scale). In addition, scene-based image editing system 106 applies convolutional layer 604c. In addition, the scene-based image editing system 106 applies the normalization layer 604d to complete the first global modulation operation 604. As shown, the first global modulation operation 604 generates global intermediate features 618. In particular, in one or more embodiments, the scene-based image editing system 106 generates the global intermediate feature 618 by combining (e.g., concatenating) the output of the first global modulation operation 604 with the encoded feature vector 620 (e.g., from a convolutional layer of an encoder with a matching scale/resolution).

As shown, the scene-based image editing system 106 also utilizes a second global modulation operation 606. In particular, scene-based image editing system 106 applies second global modulation operation 606 to global intermediate feature 618 to generate new global feature map 622. In particular, scene-based image editing system 106 applies global modulation layer 606a to global intermediate features 618 (e.g., conditional on global feature vector 616). In addition, scene-based image editing system 106 applies convolution layer 606b and normalization layer 606c to generate new global feature map 622. As shown, in some embodiments, scene-based image editing system 106 applies spatial bias when generating new global feature map 622.

Further, as shown in fig. 6, scene-based image editing system 106 utilizes spatial modulation block 603. Specifically, spatial modulation block 603 includes global modulation operation 608 and spatial modulation operation 610. Global modulation operation 608 processes local feature map 624. For example, the local feature map 624 includes feature vectors generated by a cascade modulation repair neural network that reflect local features (e.g., low-level, specific, or spatially varying features). Thus, for example, the local feature map 624 includes feature vectors reflecting local features generated from previous spatial modulation blocks of the concatenated decoder layer. In some cases, global feature map 612 also includes feature vectors corresponding to the encoded feature vectors generated by the encoder (e.g., at the first decoder layer, in various implementations, scene-based image editing system 106 utilizes encoded feature vectors, style codes, noise vectors, or other feature vectors).

As shown, scene-based image editing system 106 generates local intermediate features 626 from local feature map 624 using global modulation operation 608. Specifically, scene-based image editing system 106 applies modulation layer 608a, upsampling layer 608b, convolution layer 608c, and normalization layer 608d. Furthermore, in some embodiments, scene-based image editing system 106 applies spatial bias and broadcast noise to the output of global modulation operation 608 to generate local intermediate features 626.

As shown in fig. 6, scene-based image editing system 106 utilizes spatial modulation operation 610 to generate a new local feature map 628. In effect, spatial modulation operation 610 modulates local intermediate features 626 based on global intermediate features 618. In particular, scene-based image editing system 106 generates spatial tensors 630 from global intermediate features 618. For example, scene-based image editing system 106 applies a convolved affine parameter network to generate spatial tensor 630. In particular, the scene-based image editing system 106 applies a convolved affine parameter network to generate the intermediate space tensor. Scene-based image editing system 106 combines the intermediate spatial tensors with global feature vector 616 to generate spatial tensors 630. The scene-based image editing system 106 modulates the local intermediate features 626 (with the spatial modulation layer 610 a) with the spatial tensor 630 and generates a modulated tensor.

As shown, scene-based image editing system 106 also applies convolutional layer 610b to the modulated tensor. Specifically, convolution layer 610b generates a convolution feature representation from the modulated tensor. In addition, scene-based image editing system 106 applies normalization layer 610c to the convolved feature representations to generate new local feature maps 628.

Although shown as normalization layer 610c, in one or more embodiments, scene-based image editing system 106 applies a demodulation layer. For example, the scene-based image editing system 106 applies a modulation-convolution-demodulation pipeline (e.g., general normalization rather than instance normalization). In some cases, the method avoids potential artifacts (e.g., water drop artifacts) caused by instance normalization. In effect, the demodulation/normalization layers include layers that scale each output feature map by a uniform demodulation/normalization value (e.g., normalized by a uniform standard deviation rather than using a constant normalized instance that depends on the data based on the content of the feature map).

As shown in fig. 6, in some embodiments, scene-based image editing system 106 also applies shift tensor 632 and broadcast noise to the output of spatial modulation operation 610. For example, spatial modulation operation 610 generates normalized/demodulated features. The scene based image editing system 106 also generates a shift tensor 632 by applying the affine parameter network to the global intermediate feature 618. The scene-based image editing system 106 combines the normalized/demodulated features, shift tensors 632, and/or broadcast noise to generate a new local feature map 628.

In one or more embodiments, upon generating the new global feature map 622 and the new local feature map 628, the scene-based image editing system 106 proceeds to the next cascaded modulation layer in the decoder. For example, the scene-based image editing system 106 uses the new global feature map 622 and the new local feature map 628 as input features to additional cascaded modulation layers at different scales/resolutions. The scene-based image editing system 106 also utilizes additional cascaded modulation layers to generate additional feature maps (e.g., utilizing additional global modulation blocks and additional spatial modulation blocks). In some cases, scene-based image editing system 106 iteratively processes feature maps using cascaded modulation layers until a final scale/resolution is reached to generate a restored digital image.

Although fig. 6 shows a global modulation block 602 and a spatial modulation block 603, in some embodiments, scene-based image editing system 106 utilizes a global modulation block followed by another global modulation block. For example, scene-based image editing system 106 replaces spatial modulation block 603 with an additional global modulation block. In such an embodiment, scene-based image editing system 106 replaces the APN (and spatial tensor) and corresponding spatial modulations shown in fig. 6 with a skip connection (skip connection). For example, scene-based image editing system 106 utilizes global intermediate features to perform global modulation with respect to local intermediate vectors. Thus, in some cases, scene-based image editing system 106 utilizes a first global modulation block and a second global modulation block.

As described above, the decoder can also be described in terms of variables and equations to illustrate the operation of the cascade modulation repair neural network. For example, as described above, the decoder stacks a series of cascaded modulation blocks to input a feature mapUp-sampling is performed. Each cascade modulation block takes as input a global title g to modulate a feature according to a global representation of a local image. In additionIn some cases, scene-based image editing system 106 provides a mechanism to correct local errors after predicting global structures. />

In particular, in some embodiments, scene-based image editing system 106 utilizes cascaded modulation blocks to address the challenges of globally and locally generating coherent features. At a high level, the scene-based image editing system 106 follows the following method: i) Decomposing the global and local features to separate the local detail from the global structure, ii) predicting a cascade of global and spatial modulations of the local detail from the global structure. In one or more implementations, the scene-based image editing system 106 utilizes spatial modulation generated from global title to make better predictions (e.g., and discard instance normalization to make the design StyleGAN2 compatible).

In particular, cascade modulation with global and local features from previous scales And->And global title g as input and generates new global and local features of the next scale/resolution +.>And->To get from->Generating a new global titleThe scene-based image editing system 106 uses a global title modulation phase comprising a modulation-convolution-demodulation process that generates the upsampled feature X.

Since the global vector g has limited expressive power in representing 2-d visual detail and the features inside and outside the hole are not consistent, global modulation may produce distortion features that are not consistent with context. To compensate, in some cases, based onThe image editing system 106 of the scene utilizes spatial modulation that generates more accurate features. Specifically, the spatial modulation uses X as a spatial code and g as a global code, and the input local characteristic is modulated in a spatially adaptive manner

Furthermore, the scene-based image editing system 106 utilizes a unique spatial modulation-demodulation mechanism to avoid potential "water drop" artifacts caused by instance normalization in conventional systems. As shown, the spatial modulation follows a modulation-convolution-demodulation pipeline.

Specifically, for spatial modulation, scene-based image editing system 106 generates spatial tensor A from feature X over a layer 2 convolved Affine Parameter Network (APN) ₀ =apn (Y). Meanwhile, the scene-based image editing system 106 generates a global vector α=fc (g) from the global object g having the full connection layer (fc) to capture the global context. Scene-based image editing system 106 generates final spatial tensor a=a ₀ +α as A ₀ And α, for scaling the intermediate feature Y of the block with the element-wise product +.:

further, for convolution, the tensor is modulatedConvolving with a 3 x 3 learnable kernel K, the result is:

for spatially aware demodulation, scene-based image editing system 106 applies demodulation steps to calculate normalized outputsSpecifically, the scene-based image editing system 106 assumes that the input feature Y is an independent random variable with unit variance and is modulatedAfter that, the desired variance of the output is unchanged, i.e. +.>Thus, this gives a demodulation calculation:

wherein the method comprises the steps ofIs the demodulation coefficient. In some cases, the scene-based image editing system 106 implements the foregoing equations using standard tensor operations.

In one or more implementations, the scene-based image editing system 106 also adds spatial bias and broadcast noise. For example, scene-based image editing system 106 will normalize featuresAdding a shift tensor b=apn (X) generated by another Affine Parameter Network (APN) from feature X to the broadcast noise n to generate a new local feature +. >

/>

Thus, in one or more embodiments, to generate content fills with replacement pixels for a digital image with replacement regions, the scene-based image editing system 106 generates an encoded signature from the digital image using an encoder of a content-aware hole-filling machine learning model (e.g., a cascade modulation repair neural network). Scene-based image editing system 106 also generates content fills of the replacement regions using the decoder of the content-aware hole-filling machine learning model. In particular, in some embodiments, scene-based image editing system 106 generates content fills of replacement regions of the digital image using local and global feature maps from one or more decoding layers of the hole-filling machine learning model of content perception.

As discussed above with reference to fig. 3-6, in one or more embodiments, the scene-based image editing system 106 utilizes a segmented neural network to generate object masks for objects depicted in the digital image, and a content-aware hole-filling machine learning model to generate content fills for those objects (e.g., for object masks generated for the objects). As further mentioned, in some embodiments, the scene-based image editing system 106 generates object mask(s) and content fill(s) to anticipate one or more modifications to the digital image-before user input for such modifications is received. For example, in one or more implementations, upon opening, accessing, or displaying the digital image 706, the scene-based image editing system 106 automatically generates object mask(s) and content fill(s) (e.g., without user input to do so). Thus, in some implementations, the scene-based image editing system 106 facilitates object-aware modification of digital images. FIG. 7 shows a diagram for generating object masking and content filling to facilitate object-aware modification of a digital image in accordance with one or more embodiments.

In one or more embodiments, the object-aware modification includes an editing operation that targets an object identified in the digital image. In particular, in some embodiments, the object-aware modification includes an editing operation that targets previously segmented objects. For example, as discussed, in some implementations, prior to receiving user input for modifying an object, the scene-based image editing system 106 generates a mask for the object depicted in the digital image. Thus, when the user selects an object (e.g., the user selects at least some pixels depicting the object), the scene-based image editing system 106 determines to target modification to the entire object, rather than requiring the user to specifically specify each pixel to edit. Thus, in some cases, object-aware modifications include modifications that target an object by managing all pixels that render the object as part of a coherent unit, rather than a single element. For example, in some implementations, object-aware modifications include, but are not limited to, move operations or delete operations.

As shown in fig. 7, the scene-based image editing system 106 utilizes a segmented neural network 702 and a hole-filling machine learning model 704 for content perception to analyze/process digital images 706. The digital image 706 depicts a plurality of objects 708a-708d against a background. Thus, in one or more embodiments, the scene-based image editing system 106 utilizes the segmented neural network 702 to identify objects 708a-708d within the digital image.

In one or more embodiments, the scene-based image editing system 106 utilizes a segmented neural network 702 and a content-aware hole-filling machine learning model 704 to analyze the digital image 706 in anticipation of receiving user input for modification of the digital image 706. Indeed, in some cases, scene-based image editing system 106 analyzes digital image 706 prior to receiving user input for such modification. For example, in some embodiments, the scene-based image editing system 106 automatically analyzes the digital image 706 in response to receiving or otherwise accessing the digital image 706. In some implementations, the scene-based image editing system 106 analyzes the digital image in response to general user input to initiate preprocessing in anticipation of subsequent modifications.

As shown in FIG. 7, the scene-based image editing system 106 utilizes the segmented neural network 702 to generate object masks 710 for objects 708a-708d depicted in the digital image 706. In particular, in some embodiments, the scene-based image editing system 106 utilizes a segmented neural network 702 to generate a separate object mask for each depicted object.

As further shown in FIG. 7, scene-based image editing system 106 utilizes content-aware hole-filling machine learning model 704 to generate content fills 712 for objects 708a-708 d. Specifically, in some embodiments, scene-based image editing system 106 utilizes content-aware hole-filling machine learning model 704 to generate separate content fills for each depicted object. As shown, scene-based image editing system 106 generates content fill 712 using object mask 710. For example, in one or more embodiments, the scene-based image editing system 106 utilizes the object mask 710 generated via the segmented neural network 702 as an indicator of a replacement region to be replaced with the content fill 712 generated by the content-aware hole-fill machine learning model 704. In some cases, scene-based image editing system 106 filters objects from digital image 706 using object mask 710, which results in remaining holes in digital image 706 being filled with content fill 712.

As shown in fig. 7, scene-based image editing system 106 utilizes object mask 710 and content fill 712 to generate a complete background 714. In one or more embodiments, the complete background image includes a set of background pixels with objects replaced with content fills. In particular, the complete background comprises the background of the digital image with the replacement of the object depicted within the digital image with the corresponding content fill. In one or more implementations, the complete context includes generating a content fill for each object in the image. Thus, when objects are in front of each other, the complete background may include various levels of completion such that the background for the first object includes a portion of the second object, and the background of the second object includes semantic regions or furthest elements in the image.

In effect, FIG. 7 shows a background 716 of the digital image 706 having holes 718a-718d, in which objects 708a-708d are depicted. For example, in some cases, scene-based image editing system 106 filters out objects 708a-708d using object mask 710, leaving holes 718a-718 d. In addition, scene-based image editing system 106 fills holes 718a-718d with content fill 712, thereby creating a complete background 714.

In other implementations, the scene-based image editing system 106 utilizes the object mask 710 as an indicator of the replacement area in the digital image 706. In particular, the scene-based image editing system 106 utilizes the object mask 710 as an indicator of potential replacement areas that may be generated from receiving user input to modify the digital image 706 by moving/removing one or more of the objects 708a-708d. Thus, scene-based image editing system 106 replaces pixels indicated by object mask 710 with content fill 712.

Although fig. 7 indicates that separate complete backgrounds are generated, it should be appreciated that in some implementations, the scene-based image editing system 106 creates the complete background 714 as part of the digital image 706. For example, in one or more embodiments, the scene-based image editing system 106 locates the content pad 712 behind its corresponding object in the digital image 706 (e.g., as a separate image layer). Furthermore, in one or more embodiments, scene-based image editing system 106 locates object mask 710 behind its corresponding object (e.g., as a separate layer). In some implementations, the scene-based image editing system 106 places the content fill 712 after the object mask 710.

Further, in some implementations, the scene-based image editing system 106 generates a plurality of filled-in (e.g., semi-complete) backgrounds for the digital image. For example, in some cases where the digital image depicts multiple objects, the scene-based image editing system 106 generates a filler background for each object from the multiple objects. To illustrate, the scene-based image editing system 106 generates a filled background for other objects of the digital image by generating a content fill for the object while the object is considered part of the background. Thus, in some cases, the content fill includes a portion of other objects that follow the object within the digital image.

Thus, in one or more embodiments, as shown in FIG. 7, the scene-based image editing system 106 generates a combined image 718. In effect, the scene-based image editing system 106 generates a combined image having the digital image 706, the object mask 710, and the content fill 712 as separate layers. Although FIG. 7 shows object masks 710 over objects 708a-708d within combined image 718, it should be appreciated that in various implementations, scene-based image editing system 106 places object masks 710 and content fills 712 after objects 708a-708 d. Thus, the scene-based image editing system 106 presents the combined image 718 for display within the graphical user interface such that the object mask 710 and the content fill 712 are hidden from view until a user interaction is received that triggers the display of those components.

Further, while fig. 7 shows the combined image 718 separate from the digital image 706, it should be appreciated that in some implementations, the combined image 718 represents a modification to the digital image 706. In other words, in some embodiments, to generate the combined image 718, the scene-based image editing system 106 modifies the digital image 706 by adding additional layers consisting of object masks 710 and content fills 712.

In one or more embodiments, the scene-based image editing system 106 utilizes the combined image 718 (e.g., the digital image 706, the object mask 710, and the content fill 712) to facilitate various object-aware modifications with respect to the digital image 706. In particular, the scene-based image editing system 106 utilizes the combined image 718 to implement an efficient graphical user interface that facilitates flexible object-aware modification. Fig. 8A-8D illustrate a graphical user interface implemented by the scene-based image editing system 106 to facilitate mobile operations in accordance with one or more embodiments.

In effect, as shown in FIG. 8A, the scene-based image editing system 106 provides a graphical user interface 802 for display on a client device 804, such as a mobile device. In addition, scene-based image editing system 106 provides digital image 806 for display with a graphical user interface.

It should be noted that the graphical user interface 802 of fig. 8A is stylistically simple. In particular, the graphical user interface 802 does not include a number of menus, options, or other visual elements in addition to the digital image 806. While the graphical user interface 802 of fig. 8A does not display menus, options, or other visual elements other than the digital image 806, it should be appreciated that the graphical user interface 802 displays at least some menus, options, or other visual elements in various embodiments—at least when the digital image 806 is initially displayed.

As further shown in FIG. 8A, the digital image 806 depicts a plurality of objects 808A-808d. In one or more embodiments, the scene-based image editing system 106 pre-processes the digital image 806 before receiving user input for a move operation. In particular, in some embodiments, scene-based image editing system 106 utilizes a segmented neural network to detect and generate masking of multiple objects 808a-808d and/or utilizes a hole-filling machine learning model of content perception to generate content fills corresponding to objects 808a-808d. Further, in one or more implementations, scene-based image editing system 106 generates object masking, content filling, and combined images when loading, accessing, or displaying digital image 806, and no user input is required other than to open/display digital image 806.

As shown in fig. 8B, scene-based image editing system 106 detects user interactions with object 808d via graphical user interface 802. In particular, fig. 8B illustrates that in various embodiments, the scene-based image editing system 106 detects user interactions (e.g., touch interactions) performed by a user's finger (a portion of a hand 810), although the user interactions are performed by other tools (e.g., a stylus or pointer controlled by a mouse or touch pad). In one or more embodiments, scene-based image editing system 106 determines that object 808d has been selected for modification based on user interaction.

In various embodiments, scene-based image editing system 106 detects user interactions for selecting object 808d via various operations. For example, in some cases, scene-based image editing system 106 detects the selection via a single tap (or click) on object 808 d. In some implementations, the scene-based image editing system 106 detects the selection of the object 808d via a double tap (or double tap) or press and hold operation. Thus, in some cases, scene-based image editing system 106 confirms the user's selection of object 808d with a second click or hold operation.

In some cases, scene-based image editing system 106 utilizes various interactions to distinguish single-object selections or multi-object selections. For example, in some cases, scene-based image editing system 106 determines a single tap for selecting a single object and a double tap for selecting multiple objects. To illustrate, in some cases, upon receiving a first tap of an object, the scene-based image editing system 106 selects the object. Further, upon receiving the second tap of the object, the scene-based image editing system 106 selects one or more additional objects. For example, in some implementations, the scene-based image editing system 106 selects one or more additional objects having the same or similar classifications (e.g., selects other people depicted in the image when the first tap interacts with a person in the image). In one or more embodiments, if the second tap is received within a threshold period of time after the first tap is received, the scene-based image editing system 106 identifies the second tap as an interaction for selecting the plurality of objects.

In some embodiments, scene-based image editing system 106 identifies other user interactions for selecting a plurality of objects within a digital image. For example, in some implementations, the scene-based image editing system 106 receives a drag action across a display of a digital image and selects all objects captured within the range of the drag action. To illustrate, in some cases, the scene-based image editing system 106 draws a box that grows with drag motion and selects all objects that fall within the box. In some cases, the scene-based image editing system 106 draws a line along the path of the drag motion and selects all objects that are intercepted by the line.

In some implementations, the scene-based image editing system 106 also allows user interaction to select different portions of an object. To illustrate, in some cases, upon receiving a first tap of an object, the scene-based image editing system 106 selects the object. Further, upon receiving the second tap on the object, the scene-based image editing system 106 selects a particular portion of the object (e.g., a limb or torso of a human or a component of a vehicle). In some cases, scene-based image editing system 106 selects the portion of the object touched by the second tap. In some cases, scene-based image editing system 106 enters a "child object" mode upon receiving the second tap and utilizes additional user interactions to select a particular portion of the object.

Returning to FIG. 8B, as shown, based on detecting a user interaction for selecting object 808d, scene-based image editing system 106 provides visual indication 812 associated with object 808 d. Indeed, in one or more embodiments, scene-based image editing system 106 detects a user interaction with a portion of object 808 d-e.g., with a subset of pixels depicting the object-and determines that the user interaction targets object 808d as a whole (rather than the particular pixel with which the user interacted). For example, in some embodiments, scene-based image editing system 106 utilizes a pre-generated object mask corresponding to object 808d to determine whether the user interaction targets object 808d or some other portion of digital image 806. For example, in some cases, upon detecting a user interaction with an area within the object mask that corresponds to object 808d, scene-based image editing system 106 determines that the user interaction targets object 808d as a whole. Thus, scene-based image editing system 106 provides visual indication 812 associated with object 808d as a whole.

In some cases, scene-based image editing system 106 indicates via graphical user interface 802 that the selection of object 808d has been registered using visual indication 812. In some implementations, the scene-based image editing system 106 utilizes the visual indication 812 to represent a pre-generated object mask corresponding to the object 808 d. Indeed, in one or more embodiments, in response to detecting a user interaction with object 808d, scene-based image editing system 106 presents a corresponding object mask. For example, in some cases, scene-based image editing system 106 presents object masks to prepare modifications to object 808d and/or to indicate that the object masks have been generated and are available for use. In one or more embodiments, rather than using visual indication 812 to represent the presentation of the object mask, scene-based image editing system 106 displays the object mask itself via graphical user interface 802.

Further, because scene-based image editing system 106 generates object masks for object 808d prior to receiving user input selecting object 808d, scene-based image editing system 106 presents visual indication 812 without the latency or delay associated with conventional systems. In other words, scene-based image editing system 106 presents visual indication 812 without any delay associated with generating object masks.

As further shown, based on detecting a user interaction for selecting object 808d, scene-based image editing system 106 provides an options menu 814 for display via graphical user interface 802. The options menu 814 shown in fig. 8B provides a plurality of options, although in various embodiments the options menu includes various numbers of options. For example, in some implementations, the options menu 814 includes one or more select options, such as the option that is determined to be popular or most frequently used. For example, as shown in FIG. 8B, the options menu 814 includes an option 816 to delete object 808 d.

Thus, in one or more embodiments, scene-based image editing system 106 provides modification options for display via graphical user interface 802 based on the context of user interactions. In effect, as just discussed, the scene-based image editing system 106 provides an options menu that provides options for interacting (e.g., modifying) with the selected object. In so doing, the scene-based image editing system 106 minimizes screen clutter typical under many conventional systems by retaining options or menus for display until it is determined that those options or menus will be useful in the current context of user interaction with digital images. Thus, the graphical user interface 802 used by the scene-based image editing system 106 allows for more flexible implementation on computing devices having relatively limited screen space, such as smartphones or tablet devices.

As shown in fig. 8C, scene-based image editing system 106 detects additional user interaction (as shown via arrow 818) via graphical user interface 802 for moving object 808d over digital image 806. In particular, scene-based image editing system 106 detects additional user interactions for moving object 808d from a first location to a second location in digital image 806. For example, in some cases, the scene-based image editing system 106 detects a second user interaction via a drag motion (e.g., a user input selects object 808d and moves across the graphical user interface 802 while remaining on object 808 d). In some implementations, following the initial selection of object 808d, scene-based image editing system 106 detects the additional user interaction as a click or tap on the second location and determines to use the second location as a new location for object 808d. It should be noted that scene-based image editing system 106 moves object 808d as a whole in response to additional user interactions.

As shown in fig. 8C, upon moving object 808d from the first position to the second position, scene-based image editing system 106 exposes (expose) content fill 820 placed after object 808d (e.g., after the corresponding object masks). In effect, as previously described, the scene-based image editing system 106 places the pre-generated content fills after the objects (or corresponding object masks) for which the content fills were generated. Thus, upon removal of object 808d from its initial position within digital image 806, scene-based image editing system 106 automatically displays the corresponding content fill. Thus, the scene-based image editing system 106 provides a seamless experience that objects can move without exposing any holes in the digital image itself. In other words, the scene-based image editing system 106 provides the digital image 806 for display as if it were a real scene in which the entire background is known.

Further, because scene-based image editing system 106 generated content fill 820 for object 808d prior to receiving user input of moving object 808d, scene-based image editing system 106 exposes or renders content fill 820 without the latency or delay associated with conventional systems. In other words, as object 808d moves over digital image 806, scene-based image editing system 106 incrementally exposes content fill 820 without any delay associated with generating content.

As further shown in fig. 8D, the scene based image editing system 106 deselects the object 808D upon completion of the move operation. In some embodiments, object 808d maintains selection of object 808d until another user interaction is received (e.g., a user interaction with another portion of digital image 806) indicating deselection of object 808d. As further indicated, upon deselection of object 808d, scene-based image editing system 106 further removes previously presented menu of options 814. Thus, the scene-based image editing system 106 dynamically presents options for interacting with objects for display via the graphical user interface 802 to maintain a displayed reduced style that does not occupy (overlap) a computing device having limited screen space.

9A-9C illustrate a graphical user interface implemented by the scene-based image editing system 106 to facilitate deletion operations in accordance with one or more embodiments. In effect, as shown in FIG. 9A, the scene-based image editing system 106 provides a graphical user interface 902 for display on a client device 904 and provides a digital image 906 for display in the graphical user interface 902.

As further shown in fig. 9B, the scene-based image editing system 106 detects a user interaction with an object 908 depicted in the digital image 906 via the graphical user interface 902. In response to detecting the user interaction, the scene-based image editing system 106 presents the corresponding object mask, providing a visual indication 910 (or the object mask itself) to display in association with the object 908, and providing an options menu 912 for display. Specifically, as shown, the options menu 912 includes options 914 for deleting the object 908 that has been selected.

Further, as shown in fig. 9C, scene-based image editing system 106 removes object 908 from digital image 906. For example, in some cases, the scene-based image editing system 106 detects additional user interactions (e.g., interactions with the option 914 for deleting the object 908) via the graphical user interface 902 and in response removes the object 908 from the digital image 906. As further shown, upon removal of the object 908 from the digital image 906, the scene-based image editing system 106 automatically exposes content fills 916 that were previously placed behind the object 908 (e.g., after the corresponding object mask). Thus, in one or more embodiments, scene-based image editing system 106 provides content fill 916 for immediate display upon removal of object 908.

Although fig. 8B, 8C, and 9B illustrate a scene-based image editing system 106 providing a menu, in one or more implementations, the scene-based image editing system 106 allows object-based editing without requiring or utilizing a menu. For example, the scene-based image editing system 106 selects the objects 808d, 908 and presents visual indications 812, 910 in response to a first user interaction (e.g., tapping on the respective objects). The scene based image editing system 106 performs object based editing of the digital image in response to the second user interaction without using a menu. For example, in response to a second user input dragging an object across an image, the scene-based image editing system 106 moves the object. Alternatively, in response to a second user input (e.g., a second tap), scene-based image editing system 106 deletes the object.

The scene-based image editing system 106 provides greater flexibility in editing digital images than conventional systems. In particular, scene-based image editing system 106 facilitates object-aware modifications that are capable of interacting with objects without targeting underlying pixels. In fact, based on the selection of some pixels that help to render an object, the scene-based image editing system 106 flexibly determines that the entire object has been selected. This is in contrast to conventional systems that require a user to select an option from a menu indicating an intent to select an object, provide a second user input indicating an object to select (e.g., a bounding box for the object or a drawing for another general boundary of the object), and another user input to generate an object mask. Instead, the scene-based image editing system 106 provides for selecting objects with a single user input (tapping on the object).

Furthermore, upon user interaction to effect the modification after the previous selection, scene-based image editing system 106 applies the modification to the entire object, rather than the particular set of pixels selected. Thus, the scene-based image editing system 106 manages objects within the digital image as objects of a real scene, which are interactive and can be handled as a coherent unit. Furthermore, as discussed, by flexibly and dynamically managing the amount of content displayed on a graphical user interface in addition to digital images, the scene-based image editing system 106 provides improved flexibility with respect to deployment on smaller devices.

Furthermore, the scene-based image editing system 106 provides improved efficiency over many conventional systems. In fact, as previously mentioned, conventional systems typically require execution of a workflow consisting of a sequence of user interactions to perform the modification. In the case where modifications are intended to target a particular object, many of these systems require several user interactions to merely indicate that the object is the subject of a subsequent modification (e.g., a user interaction to identify the object and separate the object from the rest of the image), and a user interaction to close a loop of the performed modification (e.g., filling a hole remaining after removal of the object). However, the scene-based image editing system 106 reduces the user interaction typically required for modification by preprocessing the digital image prior to receiving user input for such modification. In fact, by automatically generating object masks and content fills, the scene-based image editing system 106 eliminates the need for user interaction to perform these steps.

In one or more embodiments, the scene-based image editing system 106 performs further processing of the digital image in anticipation of modifying the digital image. For example, as previously described, in some implementations, the scene-based image editing system 106 generates a semantic scene graph from a digital image. Thus, in some cases, upon receiving one or more user interactions for modifying a digital image, scene-based image editing system 106 performs the modification using the semantic scene graph. Indeed, in many cases, scene-based image editing system 106 generates a semantic scene graph for modifying a digital image prior to receiving user input for modifying the digital image. FIGS. 10-15 illustrate diagrams of semantic scene graphs for generating digital images in accordance with one or more embodiments.

In fact, many conventional systems are inflexible in that they typically wait for user interaction before determining the characteristics of the digital image. For example, such conventional systems typically wait for a user interaction indicating a characteristic to be determined, and then perform a corresponding analysis in response to receiving the user interaction. Thus, these systems do not have the useful characteristics of being readily available. For example, upon receiving a user interaction for modifying a digital image, conventional systems typically must perform an analysis of the digital image to determine characteristics to be changed after the user interaction has been received.

Furthermore, as previously mentioned, such operations result in inefficient operation, as image editing typically requires a workflow of user interactions, many of which are used to determine the characteristics used in the execution of the modification. Thus, conventional systems typically require a significant amount of user interaction to determine the characteristics required for editing.

Scene-based image editing system 106 provides advantages by generating a semantic scene graph for a digital image in anticipation of modifications to the digital image. In fact, by generating a semantic scene graph, the scene-based image editing system 106 increases flexibility relative to conventional systems because it facilitates the use of digital image characteristics in the image editing process. Furthermore, the scene-based image editing system 106 provides improved efficiency by reducing the user interaction required to determine these characteristics. In other words, the scene-based image editing system 106 eliminates user interaction often required in the preparatory steps of editing digital images in conventional systems. Thus, the scene-based image editing system 106 enables user interaction to more directly notice the image editing itself.

Furthermore, by generating a semantic scene graph for the digital image, the scene-based image editing system 106 intelligently generates/obtains information that allows editing the image as a real world scene. For example, the scene-based image editing system 106 generates a scene graph indicating objects, object properties, object relationships, etc., which allows the scene-based image editing system 106 to implement object/scene-based image editing.

In one or more embodiments, the semantic scene graph includes a graphical representation of a digital image. In particular, in some embodiments, the semantic scene graph includes a graph that maps out characteristics of the digital image and its associated characteristic attributes. For example, in some implementations, the semantic scene graph includes a node graph having nodes representing characteristics of the digital image and values associated with nodes representing characteristic attributes of those characteristics. Furthermore, in some cases, edges between nodes represent relationships between characteristics.

As described above, in one or more implementations, the scene-based image editing system 106 utilizes one or more predetermined or pre-generated template maps in generating a semantic scene graph for a digital image. For example, in some cases, scene-based image editing system 106 utilizes an image analysis map in generating a semantic scene map. FIG. 10 illustrates an image analysis diagram 1000 used by the scene-based image editing system 106 in generating a semantic scene graph in accordance with one or more embodiments.

In one or more embodiments, the image analysis graph includes a template graph for constructing a semantic scene graph. Specifically, in some embodiments, the image analysis map includes a template map used by scene-based image editing system 106 to organize information included in the semantic scene map. For example, in some implementations, the image analysis graph includes a template graph of nodes that indicate how to organize a semantic scene graph that represents characteristics of the digital image. In some cases, the image analysis map additionally or alternatively indicates information to be represented in the semantic scene map. For example, in some cases, the image analysis graph indicates characteristics, relationships, and characteristic properties of digital images to be represented in the semantic scene graph.

In fact, as shown in FIG. 10, the image analysis graph 1000 includes a plurality of nodes 1004a-1004g. Specifically, the plurality of nodes 1004a-1004g correspond to characteristics of the digital image. For example, in some cases, the plurality of nodes 1004a-1004g represent a category of characteristics to be determined when analyzing a digital image. In fact, as shown, the image analysis map 1000 indicates that the semantic scene map will represent objects and groups of objects within the digital image, as well as the scene of the digital image, including light sources, settings, and specific locations.

As further shown in FIG. 10, the image analysis graph 1000 includes an organization of a plurality of nodes 1004a-1004g. In particular, the image analysis graph 1000 includes edges 1006a-1006h arranged in a manner that organizes a plurality of nodes 1004a-1004g. In other words, the image analysis chart 1000 shows the relationship between the characteristic categories included therein. For example, the image analysis graph 1000 indicates that the object class represented by node 1004f is closely related to the object group class represented by node 1004g, both depicting objects depicted in the digital image.

Further, as shown in FIG. 10, the image analysis graph 1000 associates property attributes with one or more of the nodes 1004a-1004g to represent property attributes of corresponding property categories. For example, as shown, the image analysis graph 1000 associates a seasonal attribute 1008a and a temporal attribute 1008b with a setting category represented by a node 1004 c. In other words, the image analysis chart 1000 indicates that seasons and times should be determined when determining the settings of the digital image. Further, as shown, the image analysis graph 1000 associates object masks 1010a and bounding boxes 1010b with object categories represented by node 1004 f. Indeed, in some implementations, the scene-based image editing system 106 generates content, such as object masks and bounding boxes, for objects depicted in the digital image. Thus, the image analysis map 1000 indicates that the pre-generated content is to be associated with nodes representing corresponding objects within the semantic scene graph generated for the digital image.

As further shown in FIG. 10, the image analysis graph 1000 associates property attributes with one or more of the edges 1006a-1006h to represent property attributes of corresponding property relationships represented by the edges 1006a-1006 h. For example, as shown, the image analysis graph 1000 associates a property attribute 1012a with an edge 1006g that indicates that an object depicted in the digital image is to be a member of a particular object group. Further, the image analysis graph 1000 associates a characteristic attribute 1012b with an edge 1006h indicating that at least some objects depicted in the digital image have a relationship with each other. FIG. 10 illustrates a sample of relationships identified between objects in various embodiments, and additional details regarding these relationships will be discussed in more detail below.

It should be noted that the property categories and property attributes represented in fig. 10 are exemplary, and that image analysis diagram 1000 includes various property categories and/or property attributes not shown in various embodiments. Furthermore, fig. 10 shows a specific organization of the image analysis map 1000, although alternative arrangements are used in different embodiments. Indeed, in various embodiments, scene-based image editing system 106 adapts to various property categories and property attributes to facilitate subsequent generation of semantic scene graphs that support various image edits. In other words, the scene-based image editing system 106 includes those property categories and property attributes that it determines to be useful in editing digital images.

In some embodiments, the scene-based image editing system 106 utilizes the real-world category description map to generate a semantic scene map for the digital image. FIG. 11 illustrates a real world category description map 1102 used by the scene-based image editing system 106 in generating a semantic scene map in accordance with one or more embodiments.

In one or more embodiments, the real world category description map includes a template map describing scene components (e.g., semantic regions) that may be depicted in a digital image. In particular, in some embodiments, the real world category description map includes a template map that is used by the scene-based image editing system 106 to provide context information to the semantic scene map regarding scene components (such as objects) that may be depicted in the digital image. For example, in some implementations, the real world category description map provides a hierarchy of object classification and/or parsing (e.g., object components) of certain objects that may be depicted in the digital image. In some cases, the real world category description map also includes object attributes associated with the objects represented therein. For example, in some cases, the real world category description map provides object properties assigned to a given object, such as shape, color, material from which the object is made, weight of the object, weight that the object can support, and/or various other properties determined to be useful in subsequently modifying the digital image. Indeed, as will be discussed, in some cases, the scene-based image editing system 106 utilizes the semantic scene graph of the digital image to suggest certain edits or to suggest avoiding certain edits to maintain consistency of the digital image relative to the context information contained in the real-world category description graph that built the semantic scene graph.

As shown in FIG. 11, the real world category description graph 1102 includes a plurality of nodes 1104a-1104h and a plurality of edges 1106a-1106e connecting some of the nodes 1104a-1104 h. In particular, in contrast to the image analysis diagram 1000 of fig. 10, the real world category description diagram 1102 does not provide a single network of interconnected nodes. Conversely, in some implementations, the real-world category description map 1102 includes a plurality of clusters of nodes 1108a-1108c that are separate and distinct from each other.

In one or more embodiments, each node cluster corresponds to a separate scene component (e.g., semantic region) category that may be depicted in the digital image. In fact, as shown in FIG. 11, each of the node clusters 1108a-1108c corresponds to a separate object category that may be depicted in a digital image. As described above, in various embodiments, the real world category description map 1102 is not limited to representing object categories, and may represent other scene component categories.

As shown in FIG. 11, each of the node clusters 1108a-1108c depicts a hierarchy of class descriptions (alternatively referred to as a hierarchy of object classifications) corresponding to the represented object classes. In other words, each of the node clusters 1108a-1108c depicts a degree of specificity/versatility that describes or marks objects. Indeed, in some embodiments, the scene-based image editing system 106 applies all of the category descriptions/tags represented in the node cluster to describe the corresponding objects depicted in the digital image. However, in some implementations, the scene-based image editing system 106 describes objects with a subset of category descriptions/tags.

As an example, node cluster 1108a includes node 1104a representing a side table category and node 1104b representing a table category. Further, as shown in fig. 11, node cluster 1108a includes an edge 1106a between node 1104a and node 1104b to indicate that the side table category is a sub-category of the table category, thereby indicating a hierarchy between the two categories applicable to the side table. In other words, node cluster 1108a indicates that the side tables may be classified as side tables and/or more generally as tables. In other words, in one or more embodiments, upon detecting a side table depicted in the digital image, the scene-based image editing system 106 marks the side table as a side table and/or as a table based on the hierarchical structure represented in the real-world category description map 1102.

The degree to which a cluster of nodes represents a hierarchy of class descriptions varies in various embodiments. In other words, the length/height of the hierarchical structure represented is different in various embodiments. For example, in some implementations, node cluster 1108a also includes nodes representing furniture categories, indicating that side tables may be categorized as furniture. In some cases, node cluster 1108a also includes nodes representing inanimate objects, indicating that the side tables may be so classified. Further, in some implementations, node cluster 1108a includes nodes that represent categories of entities, indicating that a side table may be classified as an entity. Indeed, in some implementations, the hierarchy of category descriptions represented within the real-world category description graph 1102 includes category descriptions/labels-e.g., entity categories-in such a high level of versatility that it generally applies to all objects represented within the real-world category description graph 1102.

As further shown in fig. 11, node cluster 1108a includes a parsing of the represented object categories (e.g., object components). In particular, node cluster 1108a includes a representation of an element of a table category for an object. For example, as shown, node cluster 1108a includes node 1104c representing a category of legs. In addition, node cluster 1108a includes an edge 1106b that indicates that a leg from the leg category is part of a table from the table category. In other words, the edge 1106b indicates that the leg is a component of the table. In some cases, node cluster 1108a includes additional nodes for representing other components that are part of a table, such as a desktop, a table board (leaf), or a baffle (acron).

As shown in fig. 11, the node 1104c representing the table leg object category is connected to the node 1104b representing the table object category, instead of the node 1104a representing the side table category of the object. Indeed, in some implementations, the scene-based image editing system 106 utilizes such a configuration based on determining that all tables include one or more legs. Thus, since the side tables are sub-categories of tables, the configuration of node cluster 1108a indicates that all side tables also include one or more legs. However, in some implementations, the scene-based image editing system 106 additionally or alternatively connects the node 1104c representing the category of leg objects to the node 1104a representing the side-table object class to specify that all side tables include one or more legs.

Similarly, node cluster 1108a includes object attributes 1110a-1110d associated with node 1104a for the side table category, and additional object attributes 1112a-1112g associated with node 1104b for the table category. Thus, node cluster 1108a indicates that object attributes 1110a-1110d are specific to a side table category, while additional object attributes 1112a-1112g are more generally associated with a table category (e.g., associated with all object categories that fall within a table category). In one or more embodiments, the object attributes 1110a-1110d and/or the additional object attributes 1112a-1112g are attributes that have been arbitrarily assigned to their respective object categories (e.g., via user input or system defaults). For example, in some cases, the scene-based image editing system 106 determines that all side tables can support 100 pounds, as shown in fig. 11, regardless of the materials used or the quality of the build. However, in some cases, object attributes 1110a-1110d and/or additional object attributes 1112a-1112g represent object attributes that are common to all objects that fall within a particular category, such as a relatively small side table size. However, in some implementations, the object attributes 1110a-1110d and/or the additional object attributes 1112a-1112g are indicators of object attributes that should be determined for the objects of the respective object categories. For example, in one or more embodiments, upon identifying the side table, the scene-based image editing system 106 determines at least one of a capacity, a size, a weight, or a supporting weight of the side table.

It should be noted that in some embodiments, there is some overlap between the object properties included in the real world category description graph and the property properties included in the image analysis graph. In fact, in many implementations, the object properties are object-specific property properties (rather than properties for settings or scenes of the digital image). Further, it should be noted that the object attributes are merely exemplary and do not necessarily reflect object attributes to be associated with the object class. Indeed, in some embodiments, the object properties shown and their association with a particular object class are configurable to accommodate different needs for editing digital images.

In some cases, a cluster of nodes corresponds to a particular class of objects and presents a class description and/or hierarchy of object components for that particular class. For example, in some implementations, node cluster 1108a corresponds only to a side table category and presents a hierarchy of category descriptions and/or object components related to the side table. Thus, in some cases, when identifying a side table within a digital image, the scene-based image editing system 106 references the node cluster 1108a of the side table category when generating a semantic scene graph, but references another node cluster when identifying a sub-class of another table within the digital image. In some cases, the other node cluster includes several similarities (e.g., similar nodes and edges) to node cluster 1108a, as the other type of table will be included in a sub-category of the table category and include one or more legs.

However, in some implementations, the node clusters correspond to a plurality of different but related object categories and present a common hierarchy of category descriptions and/or object components for those object categories. For example, in some embodiments, node cluster 1108a includes an additional node representing a table category that is connected to node 1104b representing a table category via an edge indicating that the table is also a sub-category of tables. Indeed, in some cases, node cluster 1108a includes nodes representing various sub-categories of the table category. Thus, in some cases, when identifying a table from a digital image, the scene-based image editing system 106 references the node cluster 1108a in generating a semantic scene graph for the digital image, regardless of the subcategory to which the table belongs.

As will be described, in some implementations, object interactions within a digital image are facilitated using a common node cluster for multiple related subcategories. For example, as described above, FIG. 11 illustrates a plurality of individual node clusters. However, as further mentioned, in some cases, the scene-based image editing system 106 includes a classification (e.g., entity classification) that is common between all of the represented objects within the real-world category description map 1102. Thus, in some implementations, the real-world class description map 1102 includes a single network of interconnected nodes, where all clusters of nodes corresponding to individual object classes are connected at a common node, such as a node representing an entity class. Thus, in some embodiments, the real world category description graph 1102 shows the relationships between all represented objects.

In one or more embodiments, the scene-based image editing system 106 utilizes behavior policies in generating a semantic scene graph of a digital image. FIG. 12 illustrates a behavior policy diagram 1202 used by the scene-based image editing system 106 in generating a semantic scene graph in accordance with one or more embodiments.

In one or more embodiments, the behavior policy map includes a template map that describes behavior of objects depicted in the digital image based on a context in which the objects are depicted. In particular, in some embodiments, the behavior policy map includes a template map that assigns behaviors to objects depicted in the digital image based on semantic understanding of the objects depicted in the digital image and/or their relationships to other objects depicted in the digital image. Indeed, in one or more embodiments, behavior policies include various relationships between various types of objects, and specify behaviors for those relationships. In some cases, scene-based image editing system 106 includes a behavior policy map as part of a semantic scene graph. In some implementations, as will be discussed further below, behavior policies are separated from the semantic scene graph, but plug-in behavior is provided based on semantic understanding and relationships of objects represented in the semantic scene graph.

As shown in FIG. 12, behavior policy graph 1202 includes a plurality of relationship indicators 1204a-1204e and a plurality of behavior indicators 1206a-1206e associated with relationship indicators 1204a-1204 e. In one or more embodiments, the relationship indicators 1204a-1204e reference a relationship subject (e.g., an object in a digital image that is the subject of the relationship) and a relationship object (e.g., an object in a digital image that is the object of the relationship). For example, the relationship indicators 1204a-1204e of FIG. 12 indicate that the relationship subject is "supported" by or is "a part of the relationship object. Further, in one or more embodiments, the behavior indicators 1206a-1206e assign behaviors to the relationship subjects (e.g., indicate that the relationship subjects "move together" or "delete together" with the relationship object). In other words, the behavior indicators 1206a-1206e provide modification instructions for the relationship subject when the relationship subject is modified.

FIG. 12 provides a small subset of the relationships identified by the scene-based image editing system 106 in various embodiments. For example, in some implementations, relationships identified by scene-based image editing system 106 and incorporated into the generated semantic scene graph include, but are not limited to, relationships described as "above," "below," "back," "front," "contact," "held," "holding," "supporting," "standing on," "worn," "wearing," "resting on," "being viewed" or "looking" at. Indeed, as described above, in some implementations, scene-based image editing system 106 utilizes pairs of relationships to describe relationships between objects in two directions. For example, in some cases, where it is described that a first object is "supported" by a second object, the scene-based image editing system 106 also describes that the second object is "supporting" the first object. Thus, in some cases, behavior policy map 1202 includes these relationship pairs, and scene-based image editing system 106 includes information in the semantic scene graph accordingly.

As further shown, the behavior policy diagram 1202 also includes a plurality of classification indicators 1208a-1208e associated with the relationship indicators 1204a-1204 e. In one or more embodiments, the classification indicators 1208a-1208e indicate the object class to which the assigned behavior applies. In fact, in one or more embodiments, the classification indicators 1208a-1208e reference the object class of the correspondence object. As shown in FIG. 12, the classification indicators 1208a-1208e indicate that an action is assigned to an object class that is a sub-class of the specified object class. In other words, FIG. 12 shows that the classification indicators 1208a-1208e reference a particular object class and indicate that the assigned behavior applies to all objects that fall within the object class (e.g., an object class that is part of a sub-class that falls within the object class).

The level of commonality or specificity of the specified object categories referenced by the classification indicators within their corresponding object classification hierarchies varies in various embodiments. For example, in some embodiments, the classification indicator references the lowest classification level (e.g., the most specific classification applicable) such that there are no subcategories and the corresponding behavior applies only to those objects having that particular object's lowest classification level. On the other hand, in some implementations, the classification indicator references the highest classification level (e.g., the most general classification applicable) or some other level above the lowest classification level such that the corresponding behavior applies to objects associated with one or more of the plurality of classification levels that exist within the specified classification level.

To provide a description of how behavior policy graph 1202 indicates assigned behavior, relationship indicator 1204a indicates a "supporting" relationship between an object (e.g., a relationship subject) and another object (e.g., a relationship object). The behavior indicator 1206a indicates a "move together" behavior associated with a "support" relationship, and the classification indicator 1208a indicates that the particular behavior applies to objects within a specified object class. Thus, in one or more embodiments, behavior policy diagram 1202 shows that an object that falls into a specified object class and has a "supporting" relationship with another object will exhibit "move together" behavior. In other words, if a first object specifying an object class is depicted in a digital image supported by a second object and the digital image is modified to move the second object, then the scene-based image editing system 106 automatically moves the first object with the second object as part of the modification according to the behavior policy map 1202. In some cases, rather than automatically moving the first object, the scene-based image editing system 106 provides suggestions for moving the first object for display within a graphical user interface for modifying the digital image.

As shown in fig. 12, some relationship indicators (e.g., relationship indicators 1204a-1204b or relationship indicators 1204c-1204 e) refer to the same relationship, but are associated with different behaviors. Indeed, in some implementations, behavior policy map 1202 assigns multiple behaviors to the same relationship. In some cases, the differences are due to differences in the specified subclasses. In particular, in some embodiments, the scene-based image editing system 106 assigns objects of one object class a particular behavior for a particular relationship, but assigns objects of another object class a different behavior for the same relationship. Thus, in the configuration behavior policy diagram 1202, the scene-based image editing system 106 manages different object categories differently in various embodiments.

FIG. 13 illustrates a semantic scene graph 1302 generated by a scene-based image editing system 106 for a digital image in accordance with one or more embodiments. In particular, the semantic scene graph 1302 shown in FIG. 13 is a simplified example of a semantic scene graph and in various embodiments does not depict all of the information included in the semantic scene graph generated by the scene-based image editing system 106.

As shown in fig. 13, a semantic scene graph 1302 is organized according to the image analysis graph 1000 described above with reference to fig. 10. In particular, semantic scene graph 1302 includes a single network of interconnected nodes that reference characteristics of a digital image. For example, semantic scene graph 1302 includes nodes 1304a-1304c representing depicted objects as indicated by their connection to node 1306. In addition, semantic scene graph 1302 includes relationship indicators 1308a-1308c that represent relationships between objects corresponding to nodes 1304a-1304 c. As further shown, the semantic scene graph 1302 includes nodes 1310 that represent commonalities between objects (e.g., where the objects are all included in the digital image, or where the objects indicate a subject or topic of the digital image). Further, as illustrated, semantic scene graph 1302 includes property attributes 1314a-1314f for objects corresponding to nodes 1304a-1304 c.

As further shown in fig. 13, the semantic scene graph 1302 includes context information from the real world category description graph 1102 described above with reference to fig. 11. In particular, semantic scene graph 1302 includes nodes 1312a-1312c that indicate object categories to which objects corresponding to nodes 1304a-1304c belong. Although not shown in FIG. 11, semantic scene graph 1302 also includes a full-level structure of object classifications for each object class represented by nodes 1312a-1312 c. However, in some cases, each of the nodes 1312a-1312c includes a pointer to their respective object classification hierarchy within the real world category description map 1102. Further, as shown in FIG. 13, semantic scene graph 1302 includes object attributes 1316a-1316e for the object categories represented therein.

Further, as shown in fig. 13, semantic scene graph 1302 includes behaviors from behavior policy graph 1202 described above with reference to fig. 12. In particular, semantic scene graph 1302 includes behavior indicators 1318a-1318b, where behavior indicators 1318a-1318b indicate behavior of objects represented therein based on their associations.

FIG. 14 shows an illustration for generating a semantic scene graph for a digital image using a template graph in accordance with one or more embodiments. In effect, as shown in FIG. 14, the scene-based image editing system 106 utilizes one or more neural networks 1404 to analyze the digital image 1402. In particular, in one or more embodiments, the scene-based image editing system 106 utilizes one or more neural networks 1404 to determine various characteristics of the digital image 1402 and/or corresponding characteristic attributes thereof. For example, in some cases, scene-based image editing system 106 utilizes a segmented neural network to identify and classify objects depicted in a digital image (as discussed above with reference to fig. 3). Furthermore, in some embodiments, scene-based image editing system 106 utilizes a neural network to determine relationships between objects and/or their object properties, as will be discussed in more detail below.

In one or more implementations, the scene-based image editing system 106 utilizes a depth estimation neural network to estimate the depth of objects in the digital image and stores the determined depth in the semantic scene graph 1412. For example, the scene-based image editing system 106 utilizes a depth estimation neural network as described in U.S. application Ser. No. 17/186,436, entitled "GENERATING DEPTH IMAGES UTILIZING A MACHINE-LEARNING MODEL BUILT FROM MIXEDDIGITAL IMAGE SOURCES AND MULTIPLE LOSS FUNCTION SETS" (generating depth images using machine learning models built from hybrid digital image sources and multiple loss function sets), filed on month 26 of 2021, which is incorporated herein by reference in its entirety. Alternatively, the scene-based image editing system 106 utilizes a depth refined neural network as described in U.S. application 17/658,873 entitled "UTILIZING MACHINE LEARNING MODELS TO GENERATE REFINED DEPTH MAPS WITH SEGMENTATION MASK GUIDANCE" (generating a refined depth map with segmentation masking guidance using a machine learning model) filed on day 4, month 12 of 2022, which is incorporated herein by reference in its entirety. Then, when editing the object to perform real scene editing, the scene-based image editing system 106 accesses depth information (e.g., the average depth of the object) of the object from the semantic scene graph 1412. For example, as objects are moved within an image, scene-based image editing system 106 then accesses depth information for objects in the digital image from semantic scene graph 1412 to ensure that the moved objects are not placed in front of objects having a smaller depth.

In one or more implementations, the scene-based image editing system 106 utilizes a depth estimation neural network to estimate illumination parameters of objects or scenes in the digital image and store the determined illumination parameters in the semantic scene graph 1412. FOR example, the scene-based image editing system 106 utilizes SOURCE-specific illumination to estimate a neural network, as described in U.S. application Ser. No. 16/558,975, entitled "DYNAMICALLY ESTIMATING LIGHT-SOURCE-SPECIFIC PARAMETERS FOR DIGITAL IMAGES USING A NEURAL NETWORK" (dynamic estimation of light SOURCE-specific parameters of digital images utilizing neural networks), filed on 3, 9, 2019, which is incorporated herein by reference in its entirety. Then, when editing an object to perform real scene editing, scene-based image editing system 106 accesses the lighting parameters of the object or scene from semantic scene graph 1412. For example, when moving objects within an image or inserting new objects in a digital image, scene-based image editing system 106 accesses illumination parameters from semantic scene graph 1412 to ensure that objects moving/placed within the digital image have realistic illumination.

In one or more implementations, the scene-based image editing system 106 utilizes a depth estimation neural network to estimate illumination parameters of objects or scenes in the digital image and stores the determined illumination parameters in the semantic scene graph 1412. FOR example, the scene-based image editing system 106 utilizes SOURCE-specific illumination to estimate a neural network, as described in U.S. application Ser. No. 16/558,975, entitled "DYNAMICALLY ESTIMATING LIGHT-SOURCE-SPECIFIC PARAMETERS FOR DIGITAL IMAGES USING A NEURAL NETWORK" (dynamic estimation of light SOURCE-specific parameters of digital images utilizing neural networks), filed on 3, 9, 2019, which is incorporated herein by reference in its entirety. Then, when editing an object to perform real scene editing, the scene-based image editing system 106 accesses lighting parameters for the object or scene from the semantic scene graph 1412. For example, when moving objects within an image or inserting new objects in a digital image, scene-based image editing system 106 accesses illumination parameters from semantic scene graph 1412 to ensure that objects moving/placed within the digital image have realistic illumination.

As further shown in fig. 14, the scene-based image editing system 106 utilizes the output of the one or more neural networks 1404 along with the image analysis map 1406, the real-world category description map 1408, and the behavioral policy map 1410 to generate a semantic scene map 1412. In particular, the scene-based image editing system 106 generates a semantic scene graph 1412 to include a description of the digital image 1402 based on the structure, property attributes, hierarchical structure of object classifications, and behaviors provided by the image analysis graph 1406, the real world category description graph 1408, and the behavior policy graph 1410.

As previously described, in one or more embodiments, the image analysis map 1406, the real world category description map 1408, and/or the behavior policy map 1410 are predetermined or pre-generated. In other words, the scene-based image editing system 106 pre-generates, constructs, or otherwise determines the content and organization of each graph prior to implementation. For example, in some cases, scene-based image editing system 106 generates image analysis map 1406, real-world category description map 1408, and/or behavior policy map 1410 based on user input.

Further, in one or more embodiments, the image analysis map 1406, the real world category description map 1408, and/or the behavior policy map 1410 are configurable. In practice, the data represented therein may be reconfigured, reorganized, and/or added or removed based on preferences or the need to edit the digital image. For example, in some cases, the behavior assigned by behavior policy map 1410 works in some image editing contexts, but not in other image editing contexts. Thus, when editing an image in another image editing context, the scene-based image editing system 106 implements one or more neural networks 1404 and image analysis graphs 1406, but implements different behavior policy graphs (e.g., behavior policy graphs configured to meet the preferences of the image editing context). Thus, in some embodiments, the scene-based image editing system 106 modifies the image analysis map 1406, the real-world category description map 1408, and/or the behavior policy map 1410 to accommodate different image editing contexts.

For example, in one or more implementations, the scene-based image editing system 106 determines a context for selecting a behavior policy map by identifying a user type. In particular, the scene-based image editing system 106 generates a plurality of behavior policy maps for various types of users. For example, the scene-based image editing system 106 generates a first behavior policy map for a novice or new user. In one or more implementations, the first behavior policy map includes a greater number of behavior policies than the second behavior policy map. In particular, for newer users, the scene-based image editing system 106 utilizes a first action policy map that provides greater action automation and less control to the user. On the other hand, the scene-based image editing system 106 uses a second behavior policy map with fewer behavior policies than the first behavior policy map for the advanced user. In this way, the scene-based image editing system 106 provides higher-level users with greater control over the relationship-based actions (automatic move/delete/edit) of objects based on relationships. In other words, by utilizing the second behavior policy map for advanced users, the scene-based image editing system 106 performs less automatic editing of related objects.

In one or more implementations, the scene-based image editing system 106 determines a context for selecting a behavior policy map based on visual content of the digital image (e.g., a type of object depicted in the digital image), an editing application being used, and the like. Thus, in one or more implementations, the scene-based image editing system 106 selects/uses the behavior policy map based on the image content, the user type, the editing application being used, or another context.

Furthermore, in some embodiments, scene-based image editing system 106 utilizes a graph in analyzing a plurality of digital images. Indeed, in some cases, the image analysis map 1406, the real world category description map 1408, and/or the behavior policy map 1410 are not specifically targeted to a particular digital image. Thus, in many cases, these figures are generic and reused by the scene-based image editing system 106 for multiple instances of digital image analysis.

In some cases, the scene-based image editing system 106 also implements one or more mappings to map between the output of the one or more neural networks 1404 and the data schemas of the image analysis map 1406, the real-world category description map 1408, and/or the behavior policy map 1410. As one example, in various embodiments, scene-based image editing system 106 utilizes various segmented neural networks to identify and classify objects. Thus, depending on the segmented neural network used, the resulting classification of a given object may be different (e.g., different terms (words) or different levels of abstraction). Thus, in some cases, the scene-based image editing system 106 utilizes a mapping that maps particular outputs of the segmented neural networks to object categories represented in the real-world category description map 1408, allowing the real-world category description map 1408 to be used in conjunction with multiple neural networks.

FIG. 15 shows another illustration of a semantic scene graph for generating a digital image in accordance with one or more embodiments. In particular, FIG. 15 illustrates an example framework for generating a semantic scene graph by scene-based image editing system 106 in accordance with one or more embodiments.

As shown in fig. 15, scene-based image editing system 106 identifies input image 1500. In some cases, scene-based image editing system 106 identifies input image 1500 based on a request. For example, in some cases, the request includes a request to generate a semantic scene graph for the input image 1500. In one or more implementations, the request to analyze the input image includes the scene-based image editing system 106 accessing, opening, or displaying the input image 1500.

In one or more embodiments, in response to the request, scene-based image editing system 106 generates object suggestions and sub-graph suggestions for input image 1500. For example, in some embodiments, scene-based image editing system 106 utilizes object suggestion network (object proposal network) 1520 to extract a set of object suggestions for input image 1500. To illustrate, in some cases, the scene-based image editing system 106 extracts a set of object suggestions for humans detected within the input image 1500, object(s) worn by humans, object(s) near humans, building, plant, animal, background object or scene (including sky or objects in the sky), and so forth.

In one or more embodiments, the object suggestion network 1520 includes the detection masking neural network 300 (specifically, the object detection machine learning model 308) discussed above with reference to fig. 3. In some cases, the object suggestion network 1520 includes a neural network, such as a regional suggestion network ("RPN"), that is part of a region-based convolutional neural network to extract a set of object suggestions represented by a plurality of bounding boxes. One example RPN is disclosed in "fast r-cnn: towards real-time object detection with region proposal networks" (Faster r-cnn: real-time target detection using regional advice networks), S.ren, K.He, R.Girshick, and J.Sun, NIPS,2015, the entire contents of which are incorporated herein by reference. As an example, in some cases, the scene-based image editing system 106 uses RPNs to extract object suggestions of important objects (e.g., detectable objects or objects having a threshold size/visibility) within the input image. The following algorithm represents one embodiment of an object suggestion set:

[o ₀ ,...,0 _N-1 ]＝f _RPN (I)

where I is the input image, fRPN (-) represents the RPN network, and oi is the ith object suggestion.

In some implementations, in conjunction with determining object suggestions, the scene-based image editing system 106 also determines coordinates of each object suggestion relative to the size of the input image 1500. Specifically, in some cases, the location of the object suggestion is based on a bounding box containing the visible portion(s) of the object within the digital image. To illustrate, for oi, the coordinates of the corresponding bounding box are defined by r _i ＝[x _i ，y _i ，w _i ，h _i ]Representation, wherein (x _i ，y _i ) Is the upper left corner coordinate, and w _i And h _i The width and height of the bounding box, respectively. Thus, the scene-based image editing system 106 determines the relative position of each important object or entity in the input image 1500 and stores the position data along with the set of object suggestions.

As described above, in some implementations, scene-based image editing system 106 also determines sub-graph suggestions for the object suggestions. In one or more embodiments, the sub-graph suggestion indicates a relationship that relates to a particular object suggestion in the input image 1500. It will be appreciated that any two different objects (oi, oj) in a digital image may correspond to two possible relationships in opposite directions. As an example, a first object may be "above" a second object, and the second object may be "below" the first object. Since each pair of objects has two possible relationships, the total number of possible relationships suggested by the N objects is N (N-1). Thus, in a system that attempts to determine all possible relationships in two directions for each object suggestion of an input image, more object suggestions results in a larger scene graph than fewer object suggestions, while increasing computational costs and reducing the inference speed of object detection.

The sub-graph suggestions reduce the number of potential relationships analyzed by the scene-based image editing system 106. In particular, as previously described, the sub-graph suggestions represent relationships involving two or more particular object suggestions. Thus, in some cases, scene-based image editing system 106 determines sub-graph suggestions for input image 1500 to reduce the number of potential relationships by clustering rather than maintaining N (N-1) possible relationships. In one or more embodiments, the scene-based image editing system 106 uses a "decomposable net" in y.li, w.ouyang, b.zhou, y.cui, j.shi, and x.wang: an efficient subgraph-based scene graph generation framework, "ECCV, clustering and subgraph suggestion generation process described in 29 of 2018, 6, incorporated herein by reference in its entirety.

As an example, for an object suggestion pair, scene-based image editing system 106 determines a subgraph based on a confidence score associated with the object suggestion. To illustrate, the scene-based image editing system 106 generates each object suggestion with a confidence score that indicates a confidence that the object suggestion is a correct match with a corresponding region of the input image. The scene-based image editing system 106 also determines sub-graph suggestions for the object suggestion pairs based on a combined confidence score that is the product of the confidence scores of the two object suggestions. The scene-based image editing system 106 also constructs the sub-graph suggestions as joint boxes of object suggestions with combined confidence scores.

In some cases, scene-based image editing system 106 also suppresses sub-graph suggestions to represent candidate relationships as two objects and one sub-graph. In particular, in some embodiments, scene-based image editing system 106 utilizes non-maximum suppression to represent candidate relationships asWherein i +.j and +.>Is with o _i The kth sub-graph, o, of all sub-graphs associated _i Subgraph of (a) includes o _j And potentially other object suggestions. After suppressing the sub-graph suggestion, scene-based image editing system 106 represents each object and sub-graph as feature vectors, respectively>And feature map->Wherein D and Ka are dimensions.

After determining the object suggestions and sub-graph suggestions for objects in the input image, the scene-based image editing system 106 retrieves and embeds the relationships from the external knowledge base 1522. In one or more embodiments, the external knowledge base includes a dataset that relates to semantic relationships of objects. In particular, in some embodiments, the external knowledge base includes a semantic network that includes descriptions of relationships (also referred to herein as "common sense relationships") between objects based on background knowledge and contextual knowledge. In some implementations, the external knowledge base includes a database on one or more servers that includes knowledge of relationships from one or more sources, including expert-created resources, crowd-sourced resources, web-based sources, dictionaries, or other sources that include information about object relationships.

Furthermore, in one or more embodiments, embedding includes a representation of the relationship with the object as a vector. For example, in some cases, the relationship embedding includes using a vector representation of triples of relationships (i.e., object tags, one or more relationships, and object entities) extracted from an external knowledge base.

Indeed, in one or more embodiments, the scene-based image editing system 106 communicates with the external knowledge base 1522 to obtain useful object relationship information for improving object and sub-graph suggestions. Furthermore, in one or more embodiments, scene-based image editing system 106 refines the object suggestions and sub-graph suggestions (represented by block 1524) using the embedded relationships, as described in more detail below.

In some embodiments, the context-based graph is prepared to retrieve relationships from the external knowledge base 1522Image editing system 106 performs an internal refinement process on the object and sub-graph suggestions (e.g., in preparation for refining the features of the object and sub-graph suggestions). In particular, scene-based image editing system 106 uses each object o _i Connected to a set of subgraphs ⁱ And each subgraph s _k And a group of objects O ^k The associated knowledge refines the object vector (representing the sub-graph) by focusing on the associated sub-graph feature graph (representing the associated object vector). For example, in some cases, the internal refinement process is expressed as:

Wherein the method comprises the steps of(representation->) Is an indication of about>(representation->) Transfer to o _i (representing the output of the softmax layer to the weight of sk), and f _s→o And f _o→s And is a nonlinear mapping function. In one or more embodiments, due to o _i And sk, the scene-based image editing system 106 applies pooling or spatial location-based attention to s→o or o→s refinement.

In some embodiments, once the internal refinement is completed, scene-based image editing system 106 extracts from the initially refined object feature vectorsPredicting object labels and associating the object labels with an external knowledge baseThe corresponding semantic entities in 1522 are matched. In particular, the scene-based image editing system 106 accesses the external knowledge base 1522 to obtain the most common relationships corresponding to object tags. The scene-based image editing system 106 also selects a predetermined number of most common relationships from the external knowledge base 1522 and refines the features of the corresponding object suggestions/feature vectors using the retrieved relationships.

In one or more embodiments, after refining the object suggestions and sub-graph suggestions using the embedded relationship, the scene-based image editing system 106 predicts object tags 1502 and predicate tags from the refinement suggestions. In particular, scene-based image editing system 106 predicts labels based on refined object/sub-graph features. For example, in some cases, scene-based image editing system 106 directly predicts each object tag using refined features of the corresponding feature vector. In addition, scene-based image editing system 106 predicts predicate labels (e.g., relationship labels) based on the subject and object feature vectors, in combination with its corresponding subgraph feature map resulting from the subgraph feature being associated with several object suggestion pairs. In one or more embodiments, the inference process for predicting tags is as follows:

Wherein f _rel (. Cndot.) and f _node (. Cndot.) represents the mapping layers for predicate and object identification, respectivelyRepresenting a convolution operation. Furthermore, the->Representing refined feature vectors based on relationships extracted from an external knowledge base.

In one or more embodiments, scene-based image editingThe system 106 also uses the predicted tags to generate a semantic scene graph 1504. In particular, the scene-based image editing system 106 uses the object tags 1502 and predicate tags from refinement features to create a graphical representation of semantic information of the input image 1500. In one or more embodiments, scene-based image editing system 106 generates a scene graph asi.noteq.j, where +.>Is a scene graph.

Thus, the scene-based image editing system 106 utilizes the relative positions of objects and their tags in conjunction with the external knowledge base 1522 to determine relationships between objects. The scene based image editing system 106 utilizes the determined relationships in generating the behavior policy map 1410. As an example, the scene-based image editing system 106 determines that the hand and the cell phone have overlapping positions within the digital image. Based on the relative position and depth information, the scene-based image editing system 106 determines that the person (associated with the hand) has a relationship of "holding" the cell phone. As another example, scene-based image editing system 106 determines that a person and a shirt have overlapping positions and overlapping depths within a digital image. Based on the relative position and relative depth information, the scene-based image editing system 106 determines that the person has a relationship of "wearing" a shirt. On the other hand, the scene-based image editing system 106 determines that the person and the shirt have overlapping positions and that the shirt has an average depth that is greater than an average depth of the person within the digital image. Based on the relative position and relative depth information, the scene-based image editing system 106 determines that the person has a relationship that is "in front of" the shirt.

By generating a semantic scene graph for a digital image, the scene-based image editing system 106 provides improved flexibility and efficiency. Indeed, as described above, because the features used in modifying the digital image when user interactions are received to perform the modification are readily available, the scene-based image editing system 106 generates semantic scene graphs to provide improved flexibility. Thus, the scene-based image editing system 106 reduces the user interaction typically required under conventional systems to determine those features (or generate desired content, such as bounding boxes or object masks) in preparation for modification. Thus, the scene-based image editing system 106 provides a more efficient graphical user interface that requires less user interaction to modify the digital image.

In addition, by generating a semantic scene graph for the digital image, the scene-based image editing system 106 provides the ability to edit two-dimensional images like real world scenes. For example, based on the generated semantic scene graph of images generated using various neural networks, the scene-based image editing system 106 determines objects, their properties (location, depth, material, color, weight, size, labels, etc.). The scene-based image editing system 106 utilizes the information of the semantic scene graph to intelligently edit the image as if the image were a real-world scene.

Indeed, in one or more embodiments, the scene-based image editing system 106 facilitates modification of digital images using semantic scene graphs generated for the digital images. For example, in one or more embodiments, scene-based image editing system 106 facilitates modifying one or more object properties of objects depicted in a digital image using a corresponding semantic scene graph. FIGS. 16-21C illustrate modifying one or more object properties of an object depicted in a digital image in accordance with one or more embodiments.

Many conventional systems are inflexible in that they often require difficult and tedious workflows to make targeted modifications to specific object properties of objects depicted in digital images. Indeed, under such systems, modifying object properties typically requires manual manipulation of the object properties. For example, modifying the shape of an object depicted in a digital image typically requires multiple user interactions to manually reconstruct the boundary of the object (typically at the pixel level), while modifying the size typically requires cumbersome interactions with a resizing tool to resize and ensure scale. Thus, in addition to being inflexible, many conventional systems suffer from inefficiency, as the process required by these systems to perform such targeted modifications typically involves a significant amount of user interaction.

The scene-based image editing system 106 provides advantages over conventional systems by operating with improved flexibility and efficiency. In effect, by presenting graphical user interface elements through which user interactions can be targeted at object properties of objects, the scene-based image editing system 106 provides greater flexibility in terms of interactivity of objects depicted in the digital image. In particular, via graphical user interface elements, scene-based image editing system 106 provides flexible selection and modification of object properties. Thus, the scene-based image editing system 106 also provides improved efficiency by reducing the user interaction required to modify object properties. Indeed, as will be discussed below, the scene-based image editing system 106 enables user interactions to interact with a description of object properties in order to modify the object properties, thereby avoiding the difficult, cumbersome workflow of user interactions required by many conventional systems.

As suggested, in one or more embodiments, scene-based image editing system 106 facilitates modifying object properties of objects depicted in a digital image by determining object properties of those objects. Specifically, in some cases, scene-based image editing system 106 utilizes a machine learning model, such as an attribute classification neural network, to determine object attributes. Fig. 16-17 illustrate an attribute classification neural network used by scene-based image editing system 106 to determine object attributes of an object in accordance with one or more embodiments. In particular, fig. 16-17 illustrate multi-attribute contrast classification neural networks used by the scene-based image editing system 106 in one or more embodiments.

In one or more embodiments, the attribute classification neural network includes a computer-implemented neural network that identifies object attributes of objects depicted in the digital image. In particular, in some embodiments, the attribute classification neural network includes a computer-implemented neural network that analyzes objects depicted in the digital image, identifies object attributes of the objects, and in response provides labels corresponding to the object attributes. It should be appreciated that in many cases, attribute classification neural networks more broadly identify and classify attributes of semantic regions depicted in digital images. Indeed, in some implementations, the attribute classification neural network determines attributes of semantic regions depicted in the digital image other than the object (e.g., foreground or background).

FIG. 16 illustrates an overview of a multi-attribute contrast classification neural network in accordance with one or more embodiments. In particular, fig. 16 illustrates a scene-based image editing system 106 that utilizes a multi-attribute contrast classification neural network to extract various attribute tags (e.g., negative, positive, and unknown tags) of objects depicted within a digital image.

As shown in fig. 16, the scene-based image editing system 106 utilizes an embedded neural network 1604 with a digital image 1602 to generate an image-object feature map 1606 and a low-level attribute feature map 1610. In particular, the scene-based image editing system 106 generates an image-object feature map 1606 (e.g., image-object feature map X) by combining the object-tag embedding vector 1608 with the high-level attribute feature map from the embedded neural network 1604. For example, object-tag embedding vector 1608 represents the embedding of an object tag (e.g., "chair").

Further, as shown in fig. 16, the scene-based image editing system 106 generates a local image-object feature vector Zrel. In particular, the scene-based image editing system 106 utilizes the image-object feature map 1606 with the locator neural network 1612 to generate the local image-object feature vector Zrel. In particular, the scene-based image editing system 106 combines the image-object feature map 1606 with local object attention feature vectors 1616 (denoted G) to generate local image-object feature vectors Zrel to reflect the segmented predictions of the relevant objects (e.g., "chairs") depicted in the digital image 1602. As further shown in FIG. 16, in some embodiments, the locator neural network 1612 is trained using a reference truth object segmentation mask 1618.

In addition, as shown in FIG. 16, scene-based image editing system 106 also generates local low-level attribute feature vector Zlow. In particular, referring to FIG. 16, the scene-based image editing system 106 utilizes the local object attention feature vector G from the locator neural network 1612 with the low-level attribute feature map 1610 to generate a local low-level attribute feature vector Zlow.

Further, as shown in fig. 16, the scene-based image editing system 106 generates a multi-attention feature vector Zatt. As shown in fig. 16, scene-based image editing system 106 generates multi-attention feature vector Zatt from image-object feature map 1606 by using attention map 1620 of multi-attention neural network 1614. Indeed, in one or more embodiments, the scene-based image editing system 106 utilizes the multi-attention feature vector Zatt to focus on features at different spatial locations related to objects depicted within the digital image 1602, while predicting attribute tags of the depicted objects.

As further shown in fig. 16, the scene-based image editing system 106 utilizes the classifier neural network 1624 to predict the attribute tags 1626 when generating a local image-object feature vector Zrel, a local low-level attribute feature vector Zlow, and a multi-attention feature vector Zatt (collectively referred to as vectors 1622 in fig. 16). Specifically, in one or more embodiments, the scene-based image editing system 106 utilizes a classifier neural network 1624 having a cascade of local image-object feature vectors Zrel, local low-level attribute feature vectors Zlow, and multi-attention feature vectors Zatt to determine attribute tags 1626 for objects (e.g., chairs) depicted within the digital image 1602. As shown in fig. 16, the scene-based image editing system 106 determines positive attribute tags for chairs depicted in the digital image 1602, negative attribute tags that are not attributes of chairs depicted in the digital image 1602, and unknown attribute tags corresponding to attribute tags that the scene-based image editing system 106 cannot be confidently classified as belonging to chairs depicted in the digital image 1602 using the classifier neural network 1624.

In some cases, the scene-based image editing system 106 utilizes probabilities (e.g., probability scores, floating point probabilities) output by the classifier neural network 1624 for particular attributes to determine whether the attributes are classified as positive, negative, and/or unknown attribute tags for objects (e.g., chairs) depicted in the digital image 1602. For example, when the probability output of a particular attribute meets a positive attribute threshold (e.g., positive probability, probability greater than 0.5), the scene-based image editing system 106 identifies the attribute as a positive attribute. Further, when the probability output of a particular attribute meets a negative attribute threshold (e.g., negative probability, probability below-0.5), the scene-based image editing system 106 identifies the attribute as a negative attribute. Furthermore, in some cases, when the probability output of a particular attribute does not meet a positive attribute threshold or a negative attribute threshold, the scene-based image editing system 106 identifies the attribute as an unknown attribute.

In some cases, the feature map includes height, width, and dimension positions (h×w×d) with D-dimensional feature vectors at each h×w image position. Further, in some embodiments, the feature vector includes a set of values representing characteristics and/or features of content (or objects) within the digital image. Indeed, in some embodiments, the feature vector includes a set of values corresponding to potential and/or salient attributes associated with the digital image. For example, in some cases, a feature vector is a multi-dimensional dataset representing features depicted within a digital image. In one or more embodiments, the feature vector includes a set of numerical metrics learned by a machine learning algorithm.

FIG. 17 illustrates an architecture of a multi-attribute contrast classification neural network in accordance with one or more embodiments. Indeed, in one or more embodiments, as shown in fig. 17, the scene-based image editing system 106 utilizes a multi-attribute contrast classification neural network with embedded neural network, locator neural network, multi-attention neural network, and classifier neural network components to determine positive and negative attribute tags (e.g., in terms of output attribute presence probabilities) for objects depicted in the digital image.

As shown in fig. 17, the scene-based image editing system 106 utilizes embedded neural networks within the multi-attribute contrast classification neural network. In particular, as shown in fig. 17, the scene-based image editing system 106 utilizes a low-level embedding layer 1704 (e.g., of the embedded neural network 1604 of fig. 16) (e.g., of the embedding NNl) to generate a low-level attribute profile 1710 from the digital image 1702. Further, as shown in fig. 17, the scene-based image editing system 106 utilizes a high-level embedding layer 1706 (e.g., of the embedded neural network 1604 of fig. 16) (e.g., of the embedding NNh) to generate a high-level attribute profile 1708 from the digital image 1702.

Specifically, in one or more embodiments, the scene-based image editing system 106 utilizes a convolutional neural network as an embedded neural network. For example, the scene-based image editing system 106 generates a D-dimensional image feature map having a spatial size H W extracted from a convolutional neural network-based embedded neural networkIn some cases, scene-based image editing system 106 utilizes the output of the penultimate layer of ResNet-50 as image feature map f _img (I)。

As shown in fig. 17, the scene-based image editing system 106 utilizes the high-level embedding layer and the low-level embedding layer embedded in the neural network to extract both the high-level attribute feature map 1708 and the low-level attribute feature map 1710. By extracting both the high-level attribute feature map 1708 and the low-level attribute feature map 1710 of the digital image 1702, the scene-based image editing system 106 addresses the heterogeneity of features between different categories of attributes. In practice, attributes span a broad semantic level.

By utilizing both low-level and high-level feature maps, the scene-based image editing system 106 accurately predicts attributes over a broad semantic level range. For example, the scene-based image editing system 106 utilizes low-level feature maps to accurately predict properties such as, but not limited to, colors (e.g., red, blue, multicolor) of depicted objects, patterns (e.g., stripes, dashed lines, stripes), geometries (e.g., shapes, sizes, gestures), textures (e.g., rough, smooth, jagged), or materials (e.g., woody, metallic, shiny, matte). Meanwhile, in one or more embodiments, the scene-based image editing system 106 utilizes high-level feature maps to accurately predict properties such as, but not limited to, object states (e.g., broken, dry, messy, full, stale) or actions (e.g., running, sitting, flying) of the depicted objects.

Further, as shown in FIG. 17, the scene-based image editing system 106 generates an image-object feature map 1714. In particular, as shown in fig. 17, the scene-based image editing system 106 combines an object-tag embedding vector 1712 (e.g., an object-tag embedding vector 1608 such as fig. 16) from a tag corresponding to an object (e.g., a "chair") with the high-level attribute feature map 1708 to generate an image-object feature map 1714 (e.g., an image-object feature map 1606 such as fig. 16). As further shown in fig. 17, the scene-based image editing system 106 utilizes a feature synthesis module (e.g., f _comp ) Which outputs an image-object feature map 1714 using the object-tag embedding vector 1712 and the high level attribute feature map 1708.

In one or more embodiments, the scene-based image editing system 106 generates the image-object feature map 1714 to provide additional signals to the multi-attribute contrast classification neural network to learn the relevant object for its predicted attributes (e.g., while also encoding features for that object). In particular, in some embodiments, the scene-based image editing system 106 incorporates an object-tag embedding vector 1712 (as feature synthesis module f _comp To generate an image-object feature map 1714) to improve the classification results of the multi-attribute contrast classification neural network by learning the multi-attribute contrast classification neural network to avoid infeasible object-attribute combinations (e.g., parked dogs, talking desks, barking sofas). Indeed, in some embodiments, the scene-based image editing system 106 also utilizes an object-tag embedding vector 1712 (as feature synthesis module f _comp To) to enable the multi-attribute contrast classification neural network to learn to associate certain object-attribute pairs together (e.g., a sphere is always round). In many cases, the multi-attribute contrast classification neural network is enabled to focus on a particular visual aspect of an object by guiding the multi-attribute contrast classification neural network about what object it predicts attributes for. This in turn improves the quality of the extracted properties of the depicted object.

In one or more embodiments, the scene-based image editing system 106 utilizes a feature synthesis module (e.g., f _comp ) To generate an image-object feature map 1714. In particular, the scene-based image editing system 106 implements a feature synthesis module (e.g., f _comp )：

f _comp (f _img (I)，φ _o )＝f _img (I)⊙f _gate (φ _o )

And

f _comp (φ _o )＝σ(W _g2 ·ReLU(W _g1 φ _o +b _g1 )+b _g2 )

in the first function above, the scene-based image editing system 106 utilizes a high-level attribute feature map f _img (I) Channel product (+) and object-tag embedded vector of (+)Filter f of (2) _gate To generate an image-object feature map +.>

Furthermore, in the second function above, the scene-based image editing system 106 uses the broadcastThe sigmoid function σ in (a) matches the feature map space dimension as a layer 2 multi-layer perceptron (MLP). Indeed, in one or more embodiments, scene-based image editing system 106 uses f _gate As a filter, the filter selects attribute features associated with the object of interest (e.g., as embedded by the object-tag vector phi _o Indicated). In many cases, scene-based image editing system 106 also utilizes f _gate To suppress incompatible object-attribute pairs (e.g., speaking tables). In some embodiments, scene-based image editing system 106 may identify object-image tags for each object depicted within the digital image and by utilizing multiple attribute pairs The identified object-image tags are used by the score neural network to output attributes for each rendered object.

Furthermore, as shown in FIG. 17, the scene-based image editing system 106 utilizes the image-object feature map 1714 with locator neural network 1716 to generate a local image-object feature vector Z _rel (e.g., also shown in FIG. 16 as localizer neural network 1612 and Z _rel ). Specifically, as shown in FIG. 17, the scene-based image editing system 106 uses a convolutional layer f with a locator neural network 1716 _rel A local object attention feature vector 1717 (e.g., G in fig. 16) reflecting the segmented prediction of the depicted object is generated. Then, as shown in FIG. 17, the scene-based image editing system 106 combines the local object attention feature vector 1717 with the image object feature map 1714 to generate a local image-object feature vector Z _rel . As shown in fig. 17, the scene-based image editing system 106 utilizes a matrix multiplication 1720 between the local object attention feature vector 1717 and the image object feature map 1714 to generate a local image-object feature vector Z _rel 。

In some cases, the digital image may include multiple objects (and/or backgrounds). Thus, in one or more embodiments, the scene-based image editing system 106 utilizes a localizer neural network to learn improved feature aggregation that suppresses irrelevant object regions (e.g., regions that are not reflected in the segmentation predictions of the target object to isolate the target object). For example, referring to the digital image 1702, the scene-based image editing system 106 utilizes the locator neural network 1716 to locate object regions such that the multi-attribute contrast classification neural network predicts attributes of the correct object (e.g., the depicted chair) but not other unrelated objects (e.g., the depicted horse). To this end, in some embodiments, the scene-based image editing system 106 utilizes a locator neural network that utilizes supervised learning of object segmentation masks (e.g., object masks related to reference truth) from a dataset of label images (e.g., reference truth images as described below).

For the purpose ofIllustratively, in some cases, the scene-based image editing system 106 utilizes a 2-stack convolution layer f _rel (e.g., kernel size 1), followed by spatial softmax to extract from the image-object feature map according to the following formulaA local object attention feature vector G (e.g., local object region) is generated: />

For example, the local object attention feature vector G includes a single data plane of h×w (e.g., a feature map having a single dimension). In some cases, the local object attention feature vector G includes a feature map (e.g., a local object attention feature map) that includes one or more feature vector dimensions.

Then, in one or more embodiments, scene-based image editing system 106 utilizes local object attention feature vector G _h，w And image-object feature map X _h，w To generate a local image-object feature vector Z according to _rel ：

In some cases, in the above functionality, scene-based image editing system 106 uses attention feature vectors G from local objects _h，w Will beIn (from image-object feature map) H X W D-dimensional feature vector X _h，w Pooling into a single D-dimensional feature vector Z _rel 。

In one or more embodiments, referring to fig. 17, the scene-based image editing system 106 trains the locator neural network 1716 to learn the local object attention feature vector 1717 (e.g., G) with direct supervision of the object segmentation mask 1718 (e.g., the benchmark truth object segmentation mask 1618 from fig. 16).

Furthermore, as shown in FIG. 17, the scene-based image editing system 106 utilizes the image-object feature map 1714 with the multi-attention neural network 1722 to generate the multi-attention feature vector Z _att (e.g., multi-attention neural network 1614 and Z of FIG. 16) _att ). In particular, as shown in FIG. 17, the scene-based image editing system 106 utilizes a convolution layer f with an image-object feature map 1714 _att (e.g., attention layer) to extract attention profile 1724 (e.g., attention 1 through attention k) (e.g., attention profile 1620 of fig. 16). Then, as further shown in FIG. 17, scene-based image editing system 106 passes extracted attention attempts 424 (attention 1 through attention k) through projection layer f _proj To extract (e.g. via linear projection) the vector Z for generating multi-attention features _att Is provided.

In one or more embodiments, scene-based image editing system 106 utilizes multi-attention feature vector Z _att The properties of the depicted object within the digital image are accurately predicted by providing focus to different portions of the depicted object and/or areas surrounding the depicted object (e.g., focusing on features at different spatial locations). To illustrate, in some cases, scene-based image editing system 106 utilizes multi-attention feature vector Z _att To extract attributes such as "barefoot" or "optical head" by focusing on different parts of a person (i.e., object) depicted in the digital image. Likewise, in some embodiments, scene-based image editing system 106 utilizes multi-attention feature vector Z _att To distinguish between different activity attributes (e.g., jump and squat) that may depend on information from the surroundings of the depicted object.

In some cases, scene-based image editing system 106 generates an attention map for each attribute depicted for an object within the digital image. For example, based onThe image editing system 106 of the scene utilizes the image-object feature map with one or more attention layers to generate an attention map from the image-object feature map for each known attribute. The scene-based image editing system 106 then uses the attention map with the projection layer to generate a multi-attention feature vector Z _att . In one or more embodiments, scene-based image editing system 106 generates various numbers of attention attempts for various properties depicted for objects within a digital image (e.g., the system may generate attention attempts for each property or a different number of attention attempts than the number of properties).

Furthermore, in one or more embodiments, scene-based image editing system 106 utilizes a hybrid shared multi-attention method that allows attention-jumping (attention hop) while generating attention-seeking diagrams from image-object feature maps. For example, scene-based image editing system 106 may utilize a convolutional layer according to the following functionImage-object feature map X (e.g. attention layer) extracts M attention attempts +.>

And

in some cases, scene-based image editing system 106 utilizes a convolutional layerHaving a 2-stack convolution layer f with the above function (3) _rel A similar architecture. By utilizing the method outlined in the second function above, the scene-based image editing system 106 utilizes different ones corresponding to different ranges of attributesAttention is directed to the collection of force diagrams.

In one or more embodiments, scene-based image editing system 106 then uses M attention profiles (e.g.,) To aggregate M attention feature vectors from the image-object feature map X>

Furthermore, referring to FIG. 17, scene-based image editing system 106 sets M attention feature vectorsPass through the projection layer->To extract one or more attention feature vectors z according to the following function ^(m) ：

Then, in one or more embodiments, scene-based image editing system 106 connects the individual attention feature vectors by according to the following functionTo generate a multi-attention feature vector z _att ：

In some embodiments, the scene-based image editing system 106 utilizes divergence loss with a multi-attention neural network in an M-attention-skip method. In particular, scene-based image editing system 106 exploits loss of divergenceThis loss of divergence encourages attention to try to focus on different (or unique) regions of the digital image (from the image-object feature map). In some cases, scene-based image editing system 106 utilizes a divergence penalty by minimizing cosine similarity (e.g., l) between attention weight vectors (e.g., E) of attention features ₂ Norms) to promote diversity between attention features. For example, scene-based image editing system 106 determines a loss of divergence according to the following function

In one or more embodiments, the scene-based image editing system 106 utilizes loss of divergenceTo learn parameters of the multi-attentive neural network 1722 and/or the multi-attribute contrast classification neural network (as a whole).

In addition, as shown in FIG. 17, scene-based image editing system 106 also generates local low-level attribute feature vector Z _low (e.g., Z of FIG. 16 _low ). In fact, as shown in FIG. 17, the scene-based image editing system 106 generates a local low-level attribute feature vector Z by combining the low-level attribute feature map 1710 and the local object attention feature vector 1717 _low . For example, as shown in FIG. 17, the scene-based image editing system 106 utilizes matrix multiplication 1726 to combine the low-level attribute feature map 1710 and the local object attention feature vector 1717 to generate a local low-level attribute feature vector Z _low 。

In one or more embodiments, the local low-level attribute feature vector Z is generated and utilized by _low The scene-based image editing system 106 improves the accuracy of low-level features (e.g., color, material) extracted for objects depicted in the digital image. In particular, in one or more implementationsIn an example, the scene-based image editing system 106 pools low-level features (represented by low-level attribute feature maps from low-level embedding layers) from local object attention feature vectors (e.g., from a locator neural network). Indeed, in one or more embodiments, the scene-based image editing system 106 constructs a local low-level attribute feature vector Z by pooling low-level features from local object attention feature vectors using a low-level feature map _low 。

As further shown in FIG. 17, scene-based image editing system 106 utilizes a system having local image-object feature vectors Z _rel Multi-attention feature vector Z _att And a local low-level attribute feature vector Z _low Classifier neural network 1732 (f) _classifier ) (e.g., classifier neural network 1624 of fig. 16) to determine a positive attribute tag 1728 and a negative attribute tag 1730 for an object (e.g., "chair") depicted within digital image 1702. In some embodiments, scene-based image editing system 106 utilizes local image-object feature vector Z _rel Multi-attention feature vector Z _att And a local low-level attribute feature vector Z _low Is used as a connection to the classifier neural network 1732 (f _classifier ) Is input in the classification layer of (c). Then, as shown in fig. 17, the classifier neural network 1732 (f _classifier ) Positive attribute labels 1728 (e.g., red, bright red, clean, giant, woody) are generated for the depicted objects in the digital image 1702, and negative attribute labels 1730 (e.g., blue, fill, pattern, multicolor) are also generated.

In one or more embodiments, the scene-based image editing system 106 utilizes a classifier neural network that is a layer 2 MLP. In some cases, scene-based image editing system 106 utilizes a classifier neural network that includes various amounts of hidden units and output logic values followed by sigmoid. In some embodiments, the classifier neural network is trained by the scene-based image editing system 106 to generate positive attribute tags and negative attribute tags. Although one or more embodiments described herein utilize a 2-layer MLP, in some cases, the scene-based image editing system 106 utilizes linear layers (e.g., within a classifier neural network, for fgate and for image-object feature maps).

Furthermore, in one or more embodiments, scene-based image editing system 106 utilizes local image-object feature vector Z _rel Multi-attention feature vector Z _att And a local low-level attribute feature vector Z _low In various combinations with the classifier neural network to extract attributes for objects depicted in the digital image. For example, in some cases, scene-based image editing system 106 provides local image-object feature vector Z _rel And a multi-attention feature vector Z _att To extract attributes for the depicted object. In some cases, as shown in FIG. 17, scene-based image editing system 106 utilizes each local image-object feature vector Z _rel Multi-attention feature vector Z _att And a local low-level attribute feature vector Z _low And connection with the classifier neural network.

In one or more embodiments, the scene-based image editing system 106 utilizes the classifier neural network 1732 to generate as output a predictive score corresponding to the attribute tags. For example, classifier neural network 1732 may generate a predictive score for one or more attribute tags (e.g., blue score of 0.04, red score of 0.9, orange score of 0.4). Then, in some cases, the scene-based image editing system 106 utilizes attribute tags corresponding to predictive scores that meet a threshold predictive score. Indeed, in one or more embodiments, the scene-based image editing system 106 selects various attribute tags (both positive and negative) by utilizing output predictive scores of attributes from the classifier neural network.

Although one or more embodiments herein describe a scene-based image editing system 106 that utilizes specific embedded neural networks, locator neural networks, multi-attention neural networks, and classifier neural networks, the scene-based image editing system 106 may utilize various types of neural networks for these components (e.g., CNNs, FCNs). Further, while one or more embodiments herein describe a scene-based image editing system 106 that combines various feature maps (and/or feature vectors) using matrix multiplication, in some embodiments, the scene-based image editing system 106 combines feature maps (and/or feature vectors) using various methods, such as, but not limited to, concatenation, multiplication, addition, and/or aggregation. For example, in some implementations, the scene-based image editing system 106 combines the local object attention feature vector and the image-object feature map by concatenating the local object attention feature vector and the image-object feature map to generate the local image-object feature vector.

Thus, in some cases, the scene-based image editing system 106 utilizes an attribute classification neural network (e.g., a multi-attribute contrast classification neural network) to determine object attributes of objects depicted in the digital image or attributes of semantic regions of the depiction that are otherwise determined. In some cases, scene-based image editing system 106 adds object properties or other properties determined for a digital image to a semantic scene graph of the digital image. In other words, the scene-based image editing system 106 utilizes the attribute-classification neural network to generate a semantic scene graph of the digital image. However, in some implementations, the scene-based image editing system 106 stores the determined object properties or other properties in a separate storage location.

Further, in one or more embodiments, scene-based image editing system 106 facilitates modifying object properties of objects depicted in a digital image by modifying one or more object properties in response to user input. Specifically, in some cases, scene-based image editing system 106 modifies object properties using a machine learning model such as a property modification neural network. FIG. 18 illustrates an attribute modification neural network used by the scene-based image editing system 106 to modify object attributes in accordance with one or more embodiments.

In one or more embodiments, the attribute modification neural network comprises a computer-implemented neural network that modifies specified object attributes (or other specified attributes that specify semantic regions) of an object. In particular, in some embodiments, the attribute modification neural network comprises a computer-implemented neural network that receives user input targeting and indicating a change in an object attribute and modifies the object attribute according to the indicated change. In some cases, the attribute modifying neural network comprises a generative network.

As shown in fig. 18, scene-based image editing system 106 provides object 1802 (e.g., a digital image depicting object 1802) and modification inputs 1804a-1804b to object modification neural network 1806. In particular, FIG. 18 shows modification inputs 1804a-1804b that include inputs for an object property to be changed (e.g., black of object 1802) and inputs for a change to occur (e.g., changing the color of object 1802 to white).

As shown in fig. 18, the object modifying neural network 1806 generates a visual feature map 1810 from the object 1802 using the image encoder 1808. In addition, the object modification neural network 1806 generates text features 1814a-1814b from the modification inputs 1804a-1804b using the text encoder 1812. In particular, as shown in FIG. 18, the object modifying neural network 1806 generates a visual feature map 1810 and text features 1814a-1814b (labeled "visual-semantic embedding space" or "VSE space") within the joint embedding space 1816.

In one or more embodiments, the object modification neural network 1806 performs text-guided visual feature manipulation to base modification inputs 1804a-1804b on visual feature map 1810 and manipulate corresponding regions of visual feature map 1810 with the provided text features. For example, as shown in fig. 18, the object modifying neural network 1806 generates a manipulated visual feature map 1820 from the visual feature map 1810 and the text features 1814a-1814b using operation 1818 (e.g., vector arithmetic operations).

As further shown in fig. 18, the object-modifying neural network 1806 also utilizes a fixed edge extractor 1822 to extract edges 1824 (boundaries) of the object 1802. In other words, the object modification neural network 1806 utilizes a fixed edge extractor 1822 to extract edges 1824 of the region to be modified.

Further, as shown, the object modification neural network 1806 utilizes a decoder 1826 to generate modification objects 1828. In particular, the decoder 1826 generates a modified object 1828 from the edge 1824 extracted from the object 1802 and the manipulated visual feature map 1820 generated from the object 1802 and the modified inputs 1804a-1804 b.

In one or more embodiments, the scene-based image editing system 106 trains the object modification neural network 1806 to process open vocabulary instructions and open domain digital images. For example, in some cases, the scene-based image editing system 106 utilizes the large-scale image subtitle data set training object to modify the neural network 1806 to learn a generic visual semantic embedding space. In some cases, scene-based image editing system 106 modifies the encoder of neural network 1806 using convolutional neural network and/or long-term memory network as objects to convert digital images and text input into visual and text features.

A more detailed description of text-guided visual feature manipulation is provided below. As previously described, in one or more embodiments, scene-based image editing system 106 utilizes joint embedded space 1816 to manipulate visual feature map 1810 with text instructions modifying inputs 1804a-1804b via vector arithmetic operations. When manipulating certain objects or object properties, the object modifying neural network 1806 is intended to modify only certain regions while maintaining other regions unchanged. Thus, the object modifying neural network 1806 is represented as Vector arithmetic operations are performed between visual feature map 1810 and text features 1814a-1814b (e.g., represented as text feature vectors).

For example, in some cases, the object modification neural network 1806 identifies regions in the visual feature map 1810 to be manipulated on the spatial feature map (i.e., with modification inputs 1804a-1804b as the basis). In some cases, the object modifying neural network 1806 provides soft grounding for the text query via a weighted sum of the visual feature maps 1810. In some cases, the object modification neural network 1806 uses text features 1814a-1814b (denoted as) Calculating weights of visual feature map 1810 as weightsAnd->Using this method, the object modification neural network 1806 provides a soft ground map +.>Which roughly locates the corresponding region in visual feature map 1810 associated with the text instruction.

In one or more embodiments, the object modifying neural network 1806 uses a ground map (grouping map) as a location adaptation coefficient to control steering intensities at different locations. In some cases, the object modifying neural network 1806 utilizes the coefficient α to control the global steering intensity, which enables continuous conversion between the source image and the steering image. In one or more embodiments, scene-based image editing system 106 maps visual features The visual feature vector at spatial position (i, j) (where i, j e {0,1,..6 }) is denoted +.>

The scene based image editing system 106 performs various types of operations via vector arithmetic operations weighted by soft ground maps and coefficients α using the object modifying neural network 1806. For example, in some cases, the scene-based image editing system 106 utilizes the object modification neural network 1806 to change an object property or a global property. The object modification neural network 1806 embeds text features of a source concept (e.g., "black triangle") and a target concept (e.g., "white triangle"), denoted t, respectively ₁ And t ₂ . The object modifying neural network 1806 performs image feature vector v at position (i, j) as follows ^i，j Is controlled by:

where i, j e {0,1,..6 } andis a visual feature vector manipulated at position (i, j) of the 7 x 7 feature map.

In one or more embodiments, the object modifying neural network 1806 removes the source signature t ₁ And target feature t ₂ Added to each visual feature vector v ^i，j . In addition, in the case of the optical fiber,<v ^i，j ，t ₁ >the value representing the soft ground map at position (i, j) is calculated as the dot product of the image feature vector and the source text feature. In other words, the value represents the visual embedding v ^i，j At text embedding t ₁ In the direction of the plane of the object. In some cases, the object modifying neural network 1806 uses this value as a location adaptive steering strength to control which regions in the image should be edited. In addition, the object modifying neural network 1806 uses the coefficient α as a super parameter that controls the image-level manipulation strength. By smoothly increasing α, the object modifying neural network 1806 achieves a smooth transition from the source attribute to the target attribute.

In some implementations, the scene-based image editing system 106 utilizes the object modification neural network 1806 to remove concepts (e.g., object properties, objects, or other visual elements) from the digital image (e.g., remove attachments to humans). In some cases, the object modification neural network 1806 represents the semantic embedding of concepts to be removed as t. Thus, the object modifying neural network 1806 performs the following removal operation:

furthermore, in some embodiments, the scene-based image editing system 106 utilizes the object modification neural network 1806 to modify the extent to which object properties (or other properties of the semantic region) appear (e.g., make red apples less red or increase the brightness of the digital image). In some cases, the object modifies the intensity of the neural network 1806 control attribute via the hyper-parameter α. By smoothly adjusting α, the subject modifying neural network 1806 gradually emphasizes or attenuates the extent to which the attribute appears, as follows:

In deriving the manipulation feature mapThereafter, the object modification neural network 1806 utilizes a decoder 1826 (image decoder) to generate a manipulated image (e.g., modification object 1828). In one or more embodiments, the scene-based image editing system 106 trains the object modification neural network 1806, as described by F.Faghri et al in "Vse++: improving visual-semantic Embeddings with Hard Negatives" (Vse++: embedded with hard negative improvement visual semantics), arXiv:1707.05612, 2017zhong, which is incorporated herein by reference in its entirety. In some cases, decoder 1826 takes 1024×7×7 feature maps as input and consists of 7 res net blocks with an upsampling layer in between, which generates 256×256 images. Further, in some cases, scene-based image editing system 106 utilizes discriminators, including multi-scale patch-based discriminators. In some implementations, the scene-based image editing system 106 trains the decoder 1826 with GAN loss, perceptual loss, and discriminator feature matching loss. Furthermore, in some embodiments, the fixed edge extractor 1822 includes a bi-directional cascade network.

19A-19C illustrate a graphical user interface implemented by the scene-based image editing system 106 to facilitate modifying object properties of objects depicted in a digital image in accordance with one or more embodiments. In fact, while FIGS. 19A-19C specifically illustrate modifying object properties of an object, it should be noted that in various embodiments, the scene-based image editing system 106 similarly modifies properties of other semantic regions (e.g., background, foreground, ground, sky, etc.) of a digital image.

In effect, as shown in FIG. 19A, the scene-based image editing system 106 provides a graphical user interface 1902 for display on a client device 1904 and provides a digital image 1906 for display within the graphical user interface 1902. As further shown, digital image 1906 depicts object 1908.

As further shown in FIG. 19A, in response to detecting a user interaction with object 1908, scene-based image editing system 106 provides properties menu 1910 for display within graphical user interface 1902. In some embodiments, the properties menu 1910 provides one or more object properties of the object 1908. In effect, FIG. 19A shows that the properties menu 1910 provides object property indicators 1912a-1912c that indicate the shape, color, and material, respectively, of the object 1908. However, it should be noted that various alternative or additional object properties are provided in various embodiments.

In one or more embodiments, scene-based image editing system 106 retrieves object properties for object property indicators 1912a-1912c from the semantic scene graph generated for digital image 1906. Indeed, in some implementations, scene-based image editing system 106 generates a semantic scene graph for digital image 1906 (e.g., before detecting user interaction with object 1908). In some cases, scene-based image editing system 106 utilizes an attribute classification neural network to determine object attributes of object 1908 and includes the determined object attributes in the semantic scene graph. In some implementations, the scene-based image editing system 106 retrieves object properties from separate storage locations.

As shown in fig. 19B, scene-based image editing system 106 detects user interactions with object attribute indicator 1912 c. In fact, in one or more embodiments, the object attribute indicators 1912a-1912c are interactive. As shown, in response to detecting the user interaction, scene-based image editing system 106 removes the corresponding object properties of object 1908 from the display. Further, in response to detecting the user interaction, the scene-based image editing system 106 provides a numeric keypad 1914 for display within the graphical user interface 1902. Thus, scene based image editing system 106 provides a prompt to enter text user input. In some cases, upon detecting a user interaction with object attribute indicator 1912c, scene-based image editing system 106 maintains the corresponding object attribute for display, allowing the user interaction to remove the object attribute upon confirming that the object attribute has become the modification target.

As shown in fig. 19C, scene-based image editing system 106 detects one or more user interactions with numeric keyboard 1914 displayed within graphical user interface 1902. In particular, scene based image editing system 106 receives text user input provided via numeric keyboard 1914. Scene-based image editing system 106 also determines that text user input provides a change to the object properties corresponding to object properties indicator 1912 c. Further, as shown, scene-based image editing system 106 provides text user input for display as part of object attribute indicator 1912 c.

In this case, user interaction with the graphical user interface 1902 provides instructions to change the material (material) of the object 1908 from a first material (e.g., wood) to a second material (e.g., metal). Thus, upon receiving text user input regarding the second material, the scene-based image editing system 106 modifies the digital image 1906 by modifying the object properties of the object 1908 to reflect the second material provided by the user.

In one or more embodiments, scene-based image editing system 106 utilizes a property modification neural network to change object properties of object 1908. In particular, as described above with reference to fig. 18, the scene-based image editing system 106 provides the digital image 1906 and the modification input of the first material and the second material composition provided by the text user input to the attribute modification neural network. Thus, the scene-based image editing system 106 utilizes the attribute modification neural network to provide as output a modified digital image depicting the object 1908 with modified object attributes.

FIGS. 20A-20C illustrate another graphical user interface implemented by the scene-based image editing system 106 to facilitate modifying object properties of objects depicted in a digital image in accordance with one or more embodiments. As shown in fig. 20A, the scene-based image editing system 106 provides a digital image 2006 depicting an object 2008 for display within a graphical user interface 2002 of a client device 2004. Further, upon detecting a user interaction with object 2008, scene-based image editing system 106 provides an properties menu 2010 having property object indicators 2012a-2012c listing object properties of object 2008.

As shown in fig. 20B, scene-based image editing system 106 detects additional user interactions with object property indicator 2012 a. In response to detecting the additional user interaction, the scene-based image editing system 106 provides an alternative properties menu 2014 for display within the graphical user interface 2002. In one or more embodiments, the alternative properties menu 2014 includes one or more options for changing the properties of the corresponding object. In effect, as shown in FIG. 20B, alternative properties menu 2014 includes alternative properties indicators 2016a-2016c that provide object properties that can be used in place of the current object properties of object 2008.

As shown in fig. 20C, scene based image editing system 106 detects user interaction with alternative attribute indicator 2016 b. Thus, scene-based image editing system 106 modifies digital image 2006 by modifying the object properties of object 2008 according to user input with alternative property indicator 2016 b. In particular, scene-based image editing system 106 modifies object 2008 to reflect alternative object properties associated with alternative property indicator 2016 b.

In one or more embodiments, scene-based image editing system 106 utilizes a textual representation of alternative object properties when modifying object 2008. For example, as described above, scene-based image editing system 106 provides a text representation as text input to an attribute-modifying neural network and utilizes the attribute-modifying neural network to output a modified digital image, where object 2008 reflects a target change in its object attributes.

Fig. 21A-21C illustrate another graphical user interface implemented by the scene-based image editing system 106 to facilitate modifying object properties of objects depicted in a digital image in accordance with one or more embodiments. As shown in fig. 21A, the scene-based image editing system 106 provides a digital image 2106 depicting an object 2108 for display within a graphical user interface 2102 of a client device 2104. In addition, upon detecting a user interaction with object 2108, scene-based image editing system 106 provides attribute menu 2110 with attribute object indicators 2112a-2012c listing object attributes of object 2108.

As shown in fig. 21B, scene-based image editing system 106 detects additional user interactions with object property indicator 2112B. In response to detecting the additional user interaction, scene-based image editing system 106 provides a slider bar 2114 for display within graphical user interface 2102. In one or more embodiments, the slider bar 2114 includes a slider bar element 2116 that indicates the extent to which the corresponding object attribute appears in the digital image 2106 (e.g., the intensity of its presence in the digital image 2106).

As shown in fig. 21C, the scene-based image editing system 106 detects a user interaction with the slider element 2116 of the slider 2114, thereby increasing the extent to which the corresponding object attribute appears in the digital image. Thus, the scene-based image editing system 106 modifies the digital image 2106 by modifying the object 2108 to reflect the increased intensity in the appearance of the corresponding object attribute.

In particular, in one or more embodiments, scene-based image editing system 106 utilizes attributes to modify the neural network to modify digital image 2106 in accordance with user interactions. In fact, as described above with reference to fig. 18, the scene-based image editing system 106 is able to modify the intensity of the appearance of the object properties via the coefficient α (strength or weakness). Thus, in one or more embodiments, the scene-based image editing system 106 adjusts the coefficient α based on the positioning of the slider element 2116 via user interaction.

By facilitating image modification targeting specific object properties as described above, the scene-based image editing system 106 provides improved flexibility and efficiency over conventional systems. In effect, the scene-based image editing system 106 provides a flexible, intuitive method of visually displaying descriptions of properties of objects and allowing user input interacting with those descriptions to change properties. Thus, the scene-based image editing system 106 allows user interaction to target object properties at a high level of abstraction (e.g., without having to interact at the pixel level), rather than requiring cumbersome manual manipulation of object properties as is typical in many conventional systems. Furthermore, because scene-based image editing system 106 enables modification of object properties via relatively few user interactions with provided visual elements, scene-based image editing system 106 enables a graphical user interface that provides improved efficiency.

As previously described, in one or more embodiments, scene-based image editing system 106 also uses a semantic scene graph generated for digital images to implement relationship-aware object modification. In particular, the scene-based image editing system 106 utilizes the semantic scene graph to inform the modifying behavior of objects depicted in the digital image based on their relationship to one or more other objects in the digital image. FIGS. 22A-25D illustrate implementing a relationship-aware object modification in accordance with one or more embodiments.

In fact, many conventional systems are inflexible in that they require different objects to interact individually for modification. This is often the case even if different objects are to be similarly modified (e.g., similarly resized or moved). For example, conventional systems typically require execution of a separate workflow via user interaction to modify a separate object, or at least to perform a preparatory step of modification (e.g., outlining the object and/or separating the object from the rest of the image). Furthermore, conventional systems are often unable to adapt to relationships between objects in digital images when modifications are performed. In practice, these systems may modify a first object within a digital image, but cannot perform modifications to a second object based on the relationship between the two objects. Thus, the resulting modified image may appear unnatural or aesthetically confusing because it does not correctly reflect the relationship between the two objects.

Thus, conventional systems are also typically inefficient in that they require a significant amount of user interaction to modify the individual objects depicted in the digital image. Indeed, as noted above, conventional systems typically require execution of a separate workflow via user interaction to perform many of the steps required to modify a separate object. Thus, many user interactions are redundant in that they are received, processed, and responded to multiple times for individual objects. Furthermore, when an object having a relationship with another object is modified, conventional systems require additional user interaction to modify the other object according to the relationship. Thus, these systems unnecessarily repeat interactions used (e.g., interactions for moving objects and then moving related objects) to perform individual modifications to related objects, even if the relationship implies modifications to be performed.

The scene-based image editing system 106 provides greater flexibility and efficiency than conventional systems by enabling relational awareness object modification. Indeed, as will be discussed, the scene-based image editing system 106 provides a flexible, simplified process for selecting related objects for modification. Thus, the scene-based image editing system 106 flexibly allows user interaction to select and modify multiple objects depicted in a digital image via a single workflow. In addition, scene-based image editing system 106 facilitates intuitive modifications to related objects such that the resulting modified image continues to reflect the relationship. Thus, the digital image modified by the scene-based image editing system 106 provides a more natural look than conventional systems.

Furthermore, the scene-based image editing system 106 improves efficiency by implementing a simplified process for selecting and modifying related objects. In particular, scene-based image editing system 106 implements a graphical user interface that reduces the user interaction required to select and modify multiple related objects. Indeed, as will be discussed, the scene-based image editing system 106 handles a relatively small amount of user interaction with one object to predict, suggest, and/or perform modifications to other objects, thereby eliminating the need for additional user interaction for such modifications.

For example, FIGS. 22A-22D illustrate a graphical user interface implemented by the scene-based image editing system 106 to facilitate relationship-aware object modification (relationship-aware object modification) in accordance with one or more embodiments. In effect, as shown in FIG. 22A, the scene-based image editing system 106 provides a digital image 2206 depicting objects 2208a-2208b and object 2220 for display within the graphical user interface 2202 of the client device 2204. In particular, digital image 2206 depicts a relationship between objects 2208a-2208b, wherein object 2208a is holding object 2208b.

In one or more embodiments, scene-based image editing system 106 references a semantic scene graph previously generated for digital image 2206 to identify relationships between objects 2208a-2208 b. Indeed, as previously described, in some cases, scene-based image editing system 106 includes relationships between objects of digital images in a semantic scene graph generated for the digital images. For example, in one or more embodiments, scene-based image editing system 106 determines relationships between objects using a machine learning model, such as one of the models discussed above with reference to fig. 15 (e.g., clustering and subgraph suggestion generation models). Thus, the scene based image editing system 106 includes the determined relationships within the representation of the digital image in the semantic scene graph. Further, prior to receiving a user interaction to modify any of the objects 2208a-2208b, the scene-based image editing system 106 determines a relationship between the objects 2208a-2208b for inclusion in the semantic scene graph.

In effect, FIG. 22A illustrates a semantic scene graph component 2210 from the semantic scene graph of digital image 2206. In particular, semantic scene graph component 2210 includes node 2212a representing object 2208a and node 2212b representing object 2208 b. Further, semantic scene graph component 2210 includes relationship indicators 2214a-2214b associated with nodes 2212a-2212 b. Relationship indicators 2214a-2214b indicate relationships between objects 2208a-2208b, where object 2208a holds object 2208b, and object 2208b is in turn held by object 2208 a.

As further shown, semantic scene graph component 2210 includes behavior indicators 2216a-2216b associated with relationship indicator 2214 b. The behavior indicators 2216a-2216b assign behaviors to the object 2208b based on the relationship of the object 2208b to the object 2208a. For example, behavior indicator 2216a indicates that object 2208b moved with object 2208a because object 2208b was held by object 2208a. In other words, when object 2208a is moved, behavior indicator 2216a instructs scene-based image editing system 106 to move object 2208b (or at least suggest that object 2208b is moved). In one or more embodiments, scene-based image editing system 106 includes behavior indicators 2216a-2216b within a semantic scene graph based on behavior policy graphs used in generating the semantic scene graph. Indeed, in some cases, the behavior assigned to a "holding" relationship (or other relationship) varies based on the behavior policy map used. Thus, in one or more embodiments, scene-based image editing system 106 references previously generated semantic scene graphs to identify relationships between objects and behaviors assigned based on these relationships.

It should be noted that semantic scene graph component 2210 indicates that the behavior of behavior indicators 2216a-2216b are assigned to object 2208b rather than object 2208a. Indeed, in one or more embodiments, scene-based image editing system 106 assigns behaviors to objects based on their roles in the relationship. For example, while it may be appropriate to move a held object (hold object) when moving the held object, in some embodiments, the scene-based image editing system 106 determines that the held object does not have to be moved when moving the held object. Thus, in some implementations, the scene-based image editing system 106 assigns different behaviors to different objects in the same relationship.

As shown in fig. 22B, scene-based image editing system 106 determines a user interaction to select object 2208 a. For example, scene-based image editing system 106 determines that the user interaction targets object 2208a for modification. As further shown, scene-based image editing system 106 provides visual indication 2218 for display to indicate selection of object 2208 a.

As shown in fig. 22C, in response to detecting a user interaction selecting object 2208a, scene-based image editing system 106 automatically selects object 2208b. For example, in one or more embodiments, upon detecting a user interaction selecting object 2208a, scene-based image editing system 106 references a semantic scene graph generated for digital image 2206 (e.g., semantic scene graph component 2210 corresponding to object 2208 a). Based on the information represented in the semantic scene graph, scene-based image editing system 106 determines that another object exists in digital image 2206 that has a relationship with object 2208 a. In effect, scene-based image editing system 106 determines that object 2208a holds object 2208b. Instead, scene-based image editing system 106 determines that object 2208b is held by object 2208 a.

Because objects 2208a-2208b have a relationship, scene-based image editing system 106 adds object 2208b to the selection. As shown in fig. 22C, scene-based image editing system 106 modifies visual indication 2218 of the selection to indicate that object 2208b has been added to the selection. Although FIG. 22C illustrates automatic selection of object 2208b, in some cases, scene-based image editing system 106 selects object 2208b based on the behavior assigned to object 2208b within the semantic scene graph according to the relationship of object 2208b to object 2208 a. Indeed, in some cases, the scene-based image editing system 106 specifies when a relationship between objects results in automatically selecting one object when the user selects another object (e.g., via a "select together" behavior). However, as shown in fig. 22C, in some cases, the scene-based image editing system 106 automatically selects the object 2208b by default.

In one or more embodiments, scene-based image editing system 106 presents object masks for object 2208a and object 2208b based on object 2208a and object 2208b being included in the selection. In effect, scene-based image editing system 106 presents pre-generated object masks to objects 2208a-2208b in anticipation of modifications to objects 2208a-2208 b. In some cases, scene-based image editing system 106 retrieves pre-generated object masks from a semantic scene graph of digital image 2206 or retrieves storage locations for pre-generated object masks. In either case, object masking is readily available after the objects 2208a-2208b are included in the selection and before modification input has been received.

As further shown in fig. 22C, scene-based image editing system 106 provides an options menu 2222 for display within graphical user interface 2202. In one or more embodiments, scene-based image editing system 106 determines that at least one of the modification options from options menu 222 will apply to both objects 2208a-2208b if selected. In particular, scene-based image editing system 106 determines that the modification selected for object 2208a will also apply to object 2208b based on the behavior assigned to object 2208b.

Indeed, in one or more embodiments, in addition to determining the relationship between objects 2208a-2208b, scene-based image editing system 106 references the semantic scene graph for digital image 2206 to determine the behavior that has been assigned based on the relationship. In particular, scene-based image editing system 106 references behavior indicators (e.g., behavior indicators 2216a-2216 b) associated with relationships between objects 2208a-2208b to determine which behaviors are assigned to objects 2208a-2208b based on their relationships. Thus, by determining the behavior assigned to object 2208b, scene-based image editing system 106 determines how to respond to the potential edits.

For example, as shown in FIG. 22D, scene-based image editing system 106 deletes objects 2208a-2208b together. For example, in some cases, in response to detecting a selection of option 2224 presented within option menu 2222, scene-based image editing system 106 deletes objects 2208a-2208b. Thus, while object 2208a is targeted for deletion via user interaction, scene-based image editing system 106 includes object 2208b in a delete operation based on the behavior assigned to object 2208b via the semantic scene graph (i.e., the "delete together" behavior). Thus, in some embodiments, the scene-based image editing system 106 implements relationship-aware object modification by deleting objects based on their relationship to other objects.

As previously described, in some implementations, if the assignment behavior of the scene-based image editing system 106 specifies that the object should be selected along with another object, the scene-based image editing system 106 only adds the object to the selection. In at least some cases, if its assignment behavior specifies that the object should be selected with another object, the scene-based image editing system 106 adds the object just prior to receiving any modification input. Indeed, in some cases, only a subset of the potential edits to the first object are applicable to the second object based on the behavior assigned to the second object. Thus, if there is no behavior providing automatic selection, then including the second object in the selection of the first object prior to receiving the modification input risks violating the rules set forth by the policy map via the semantic scene map. To avoid this risk, in some implementations, the scene-based image editing system 106 waits until a modification input has been received before determining whether to add the second object to the selection. However, in one or more embodiments, as shown in FIGS. 22A-22D, the scene-based image editing system 106 automatically adds a second object upon detecting a selection of the first object. In such embodiments, the scene-based image editing system 106 deselects the second object when it is determined that the modification to the first object is not applicable to the second object based on the behavior assigned to the second object.

As further shown in fig. 22D, object 2220 remains in digital image 2206. In effect, scene-based image editing system 106 does not add object 2220 to the selection in response to user interaction with object 2208a, nor does object 2220 delete with objects 2208a-2208 b. For example, upon referencing the semantic scene graph of digital image 2206, scene-based image editing system 106 determines that there is no relationship (at least, no relationship applied to the scene) between object 2220 and any of objects 2208a-2208 b. Thus, the scene-based image editing system 106 enables user interactions to modify related objects together while preventing non-related objects from being modified without more targeted user interactions.

Further, as shown in FIG. 22D, when objects 2208a-2208b are removed, scene-based image editing system 106 exposes content fill 2226 within digital image 2206. In particular, upon deleting an object 2208a-2208b, the scene-based image editing system 106 exposes content fills previously generated for the object 2208a and content fills previously generated for the object 2208 b. Thus, the scene-based image editing system 106 facilitates seamless modification of the digital image 2206 as if it were a real scene.

23A-23C illustrate another graphical user interface implemented by the scene-based image editing system 106 to facilitate relational aware object modification in accordance with one or more embodiments. In effect, as shown in FIG. 23A, the scene-based image editing system 106 provides digital images 2306 depicting objects 2308a-2308b and objects 2320 for display within a graphical user interface 2302 of a client device 2304. In particular, the digital image 2306 depicts the relationship between objects 2308a-2308b, where object 2308a holds object 2308b.

As further shown in fig. 23A, scene-based image editing system 106 detects a user interaction selecting object 2308 a. In response to detecting the user interaction, scene-based image editing system 106 provides a suggestion to add object 2308b to the selection. In particular, scene-based image editing system 106 provides an option 2312 for asking the user if he or she wishes to add object 2308b to selected text box 2310, and an option 2314 for agreeing to add object 2308b and an option 2314 for rejecting to add object 2308b.

In one or more embodiments, the scene-based image editing system 106 provides suggestions for adding objects 2308b to a selection based on determining relationships between objects 2308a-2308b via a semantic scene graph generated for the digital image 2306. In some cases, scene-based image editing system 106 also provides suggestions for adding object 2308b according to the behavior assigned to object 2308b based on the relationship.

As shown in fig. 23A, scene-based image editing system 106 does not suggest adding object 2320 to the selection. Indeed, in one or more embodiments, based on the reference semantic scene graph, scene-based image editing system 106 determines that there is no relationship (at least no correlation) between object 2320 and any of objects 2308a-2308 b. Thus, scene based image editing system 106 determines to omit object 2320 from the suggestion.

As shown in fig. 23B, scene-based image editing system 106 adds object 2308B to the selection. In particular, in response to receiving a user interaction with option 2312 to agree to add object 2308b, scene-based image editing system 106 adds object 2308b to the selection. As shown in fig. 23B, scene-based image editing system 106 modifies visual indication 2316 of the selection to indicate that object 2308B is added to the selection along with object 2308 a.

As shown in fig. 23C, in response to detecting one or more additional user interactions, scene-based image editing system 106 modifies digital image 2306 by moving object 2308a within digital image 2306. Further, scene-based image editing system 106 moves object 2308b along with object 2308a based on including object 2308b in the selection. Thus, the scene-based image editing system 106 implements relationship-aware object modification by moving objects based on their relationships to other objects.

24A-24C illustrate yet another graphical user interface implemented by the scene-based image editing system 106 to facilitate relational awareness object modification in accordance with one or more embodiments. In effect, as shown in FIG. 24A, scene-based image editing system 106 provides digital image 2406 depicting objects 2408a-2408b and object 2420 for display within graphical user interface 2402 of client device 2404. Specifically, digital image 2406 depicts a relationship between objects 2408a-2408b because object 2408a holds object 2408b.

As shown in fig. 24A, scene-based image editing system 106 detects a user interaction with object 2408 a. In response to detecting the user interaction, the scene-based image editing system 106 provides an options menu 2410 for display within the graphical user interface 2402. As shown, the options menu 2410 includes an option 2412 for deleting the object 2408 a.

As shown in fig. 24B, scene-based image editing system 106 detects additional user interactions with option 2412 for deleting object 2408 a. In response to detecting the additional user interaction, scene-based image editing system 106 provides a suggestion for adding object 2408b to the selection for display via text box 2414 asking the user if the user wishes to add object 2408b to the selection, option 2416 for agreeing to add object 2408b, and option 2418 for refusing to add object 2308 b.

Indeed, as described above, in one or more embodiments, the scene-based image editing system 106 waits to receive input to modify the first object before suggesting the addition of the second object (or automatically adding the second object). Thus, the scene-based image editing system 106 determines whether the relationship between the object and the pending modification indicates that the second object should be added before including the second object in the selection.

To illustrate, in one or more embodiments, upon detecting additional user interaction with option 2412, scene-based image editing system 106 references the semantic scene graph of digital image 2406. Upon referencing the semantic scene graph, scene-based image editing system 106 determines that object 2408a has a relationship with object 2408b. Further, the behavior assigned to object 2408b based on the relationship determination by scene-based image editing system 106 indicates that object 2408b should be deleted with object 2408 a. Thus, upon receiving additional user interactions for deleting object 2408a, scene-based image editing system 106 determines that object 2408b should also be deleted, and then provides a suggestion to add object 2408b (or automatically add object 2408 b) to the selection.

As shown in fig. 24C, scene-based image editing system 106 deletes object 2408a and object 2408b together from digital image 2406. Specifically, in response to detecting a user interaction with the option 2416 for adding the object 2408b to the selection, the scene-based image editing system 106 adds the object 2408b and performs a delete operation. In one or more embodiments, upon detecting a user interaction with the option 2418 to reject the add object 2408b, the scene-based image editing system 106 omits the object 2408b from the selection and deletes only the object 2408a.

Although moving objects or deleting objects based on their relationship to other objects is specifically discussed above, it should be noted that in various embodiments, scene-based image editing system 106 implements various other types of relationship-aware object modifications. For example, in some cases, the scene-based image editing system 106 implements the relationship-aware object modification via resizing, re-coloring, or restoring the original texture modification or composition. Further, as previously described, in some embodiments, the behavior policies used by the scene-based image editing system 106 are configurable. Thus, in some implementations, the relationship-aware object modifications implemented by the scene-based image editing system 106 change based on user preferences.

In one or more embodiments, in addition to modifying objects based on relationships described within the behavior policy graphs incorporated into the semantic scene graph, scene-based image editing system 106 modifies objects based on classification relationships. In particular, in some embodiments, the scene-based image editing system 106 modifies objects based on relationships described by the real-world category description map incorporated into the semantic scene graph. In fact, as previously described, the real world category description map provides a hierarchy of object classifications for objects that may be depicted in a digital image. Thus, in some implementations, the scene-based image editing system 106 modifies objects within the digital image via their respective object classification hierarchies based on their relationships to other objects. For example, in one or more embodiments, scene-based image editing system 106 adds objects to the selection for modification via its respective object classification hierarchy based on the relationships of the objects to other objects. FIGS. 25A-25D illustrate a graphical user interface implemented by the scene-based image editing system 106 to add objects to a selection for modification based on a classification relationship in accordance with one or more embodiments.

In particular, FIG. 25A illustrates that the scene-based image editing system 106 provides a digital image 2506 depicting a plurality of objects 2508a-2508g for display in a graphical user interface 2502 of a client device 2504. Specifically, as shown, the objects 2508a-2508g include various items, such as shoes, eyeglasses, and jackets.

FIG. 25A also shows semantic scene graph components 2510a-2510c from the semantic scene graph of digital image 2506. In effect, the semantic scene graph components 2510a-2510c include portions of a semantic scene graph that provide an object classification hierarchy for each of the objects 2508a-2508 g. Alternatively, in some cases, semantic scene graph components 2510a-2510c represent portions of a real world category description graph used to make a semantic scene graph.

As shown in fig. 25A, semantic scene graph component 2510a includes a node 2512 representing a clothing category, a node 2514 representing an accessory category, and a node 2516 representing a shoe category. As further shown, the accessory category is a subcategory of the apparel category, and the footwear category is a subcategory of the accessory category. Similarly, the semantic scene graph component 2510b includes a node 2518 representing a clothing category, a node 2520 representing an accessory category, and a node 2522 representing a glasses category as a sub-category of the accessory category. Further, the semantic scene graph component 2510c includes a node 2524 representing a clothing category and a node 2526 representing an overcoat category as another sub-category of the clothing category. Thus, semantic scene graph components 2510a-2510c provide various classifications that apply to each of the objects 2508a-2508 g. In particular, semantic scene graph component 2510a provides a hierarchy of object classifications associated with shoes presented in digital image 2506, semantic scene graph component 2510b provides a hierarchy of object classifications associated with glasses, and semantic scene graph component 2510c provides a hierarchy of object classifications associated with jackets.

As shown in fig. 25B, scene-based image editing system 106 detects a user interaction selecting object 2508 e. In addition, scene-based image editing system 106 detects a user interaction selecting object 2508 b. As further shown, in response to detecting the selection of object 2508b and object 2508e, scene-based image editing system 106 provides a text box 2528, which text box 2528 suggests adding all shoes in digital image 2506 to the selection.

To illustrate, in some embodiments, in response to detecting selection of object 2508b and object 2508e, scene-based image editing system 106 references a semantic scene graph (e.g., semantic scene graph components associated with object 2508b and object 2508 e) generated for digital image 2506. Based on the reference semantic scene graph, scene-based image editing system 106 determines that both object 2508b and object 2508e are part of a footwear category. Thus, scene-based image editing system 106 determines, via the footwear category, that a classification relationship exists between object 2508b and object 2508 e. In one or more embodiments, based on determining that both object 2508b and object 2508e are part of a footwear category, scene-based image editing system 106 determines to provide a user interaction that selects to target all of the shoes within digital image 2506. Accordingly, scene-based image editing system 106 provides a text box 2528, which text box 2528 suggests adding other shoes to the selection. In one or more embodiments, upon receiving a user interaction accepting the suggestion, the scene-based image editing system 106 adds other shoes to the selection.

Similarly, as shown in FIG. 25C, scene-based image editing system 106 detects a user interaction selecting object 2508C and another user interaction selecting object 2508 b. In response to detecting the user interaction, scene-based image editing system 106 references the semantic scene graph generated for digital image 2506. Based on the reference semantic scene graph, the scene-based image editing system 106 determines that the object 2508b is part of a footwear category that is a sub-category of the accessory category. In other words, the scene-based image editing system 106 determines that the object 2508b is part of a accessory category. Likewise, the scene-based image editing system 106 determines that the object 2508c is part of a glasses category that is a sub-category of an accessory category. Thus, the scene-based image editing system 106 determines that a classification relationship exists between the object 2508b and the object 2508c via the accessory category. As shown in fig. 25C, based on determining that both object 2508b and object 2508C are part of an accessory category, scene-based image editing system 106 provides a text box 2530 suggesting that all other accessories depicted in digital image 2506 (e.g., other shoes and glasses) be added to the selection.

In addition, as shown in FIG. 25D, scene-based image editing system 106 detects a user interaction that selects object 2508a and another user interaction that selects object 2508 b. In response to detecting the user interaction, scene-based image editing system 106 references the semantic scene graph generated for digital image 2506. Based on the reference semantic scene graph, the scene-based image editing system 106 determines that the object 2508b is part of a footwear category that is a sub-category of an accessory category that is a sub-category of a clothing category. Similarly, the scene-based image editing system 106 determines that the object 2508a is part of an overcoat category, which is also a subcategory of a clothing category. Thus, the scene-based image editing system 106 determines that a classification relationship exists between the object 2508b and the object 2508a via the clothing class. As shown in fig. 25D, based on determining that both object 2508b and object 2508a are part of a clothing category, scene-based image editing system 106 provides a text box 2532, which text box 2532 suggests that all other clothing items depicted in digital image 2506 be added to the selection.

Thus, in one or more embodiments, scene-based image editing system 106 predicts objects that are targeted user interactions and facilitates faster selection of those objects based on their classification relationships. In some embodiments, upon selection of multiple objects via the provided suggestions, the scene-based image editing system 106 modifies the selected objects in response to additional user interactions. In effect, scene-based image editing system 106 modifies the selected objects together. Thus, the scene-based image editing system 106 implements a graphical user interface that provides a more flexible and efficient method of selecting and modifying multiple related objects using reduced user interactions.

Indeed, as previously described, the scene-based image editing system 106 provides improved flexibility and efficiency over conventional systems. For example, by selecting objects based on selection of related objects (e.g., automatically or via suggestions), the scene-based image editing system 106 provides a flexible method of targeting multiple objects for modification. In effect, the scene-based image editing system 106 flexibly identifies related objects and includes them in the selection. Thus, the scene-based image editing system 106 implements a graphical user interface that reduces user interaction typically required in conventional systems for selecting and modifying multiple objects.

In one or more embodiments, the scene-based image editing system 106 also pre-processes the digital image to help remove interfering objects. In particular, the scene-based image editing system 106 utilizes machine learning to identify objects in a digital image, classify one or more of the objects as interfering objects, and facilitate removal of the interfering objects to provide a more visually consistent and aesthetically pleasing resulting image. Furthermore, in some cases, scene-based image editing system 106 utilizes machine learning to facilitate removing shadows associated with interfering objects. 26-39C illustrate schematic diagrams of a scene-based image editing system 106 identifying and removing interfering objects and shadows thereof from digital images in accordance with one or more embodiments.

Many conventional systems lack flexibility in the method for removing interfering humans because they deprive control from the user's hand. For example, conventional systems typically automatically remove people classified as interference by them. Thus, such systems do not provide an opportunity for user interaction to provide input regarding the removal process when a digital image is received. For example, these systems do not allow user interaction to remove a person identified to be removed from a group of people.

Furthermore, conventional systems often do not flexibly remove all types of interfering objects. For example, many conventional systems are not flexible in removing shadows cast by interfering objects and non-human objects. Indeed, while some existing systems identify and remove interfering people in digital images, these systems often fail to identify shadows cast by people or other objects in digital images. Thus, the resulting digital image will still include the effects of the interfering human being, as its shadows remain despite the removal of the interfering human being itself. This further results in these conventional systems requiring additional user interaction to identify and remove these shadows.

Scene-based image editing system 106 solves these problems by providing more user control during the removal process while reducing the interactions typically required to delete objects from the digital image. Indeed, as will be explained below, the scene-based image editing system 106 presents the identified interfering objects as a set of objects selected for removal for display. The scene-based image editing system 106 also enables user interactions to add objects to the collection, remove objects from the collection, and/or determine when to delete selected objects. Thus, the scene-based image editing system 106 employs flexible workflows to remove interfering objects based on machine learning and user interactions.

Furthermore, the scene-based image editing system 106 flexibly identifies and removes shadows associated with interfering objects within the digital image. By removing shadows associated with the interfering objects, the scene-based image editing system 106 provides better image results because additional aspects of the interfering objects and their effects within the digital image are removed. This allows for reduced user interaction compared to conventional systems, as the scene-based image editing system 106 does not require additional user interaction to identify and remove shadows.

FIG. 26 illustrates a neural network pipeline used by the scene-based image editing system 106 to identify and remove interfering objects from digital images in accordance with one or more embodiments. In effect, as shown in FIG. 26, scene-based image editing system 106 receives digital image 2602 depicting a plurality of objects. As shown, the scene-based image editing system 106 provides digital images 2602 to a neural network pipeline that includes a segmentation neural network 2604, an interferent detection neural network 2606, a shadow detection neural network 2608, and a repair neural network 2610.

In one or more embodiments, the scene-based image editing system 106 utilizes one of the segmented neural networks described above (e.g., the detection masking neural network 300 discussed with reference to fig. 3) as the segmented neural network 2604. In some embodiments, scene-based image editing system 106 utilizes one of the content-aware machine learning models discussed above as repair neural network 2610 (e.g., cascade modulation repair neural network 420 discussed with reference to fig. 4). The interferent detection neural network 2606 and the shadow detection neural network 2608 will be discussed in more detail below.

As shown in fig. 26, scene-based image editing system 106 utilizes a neural network pipeline to generate modified digital image 2612 from digital image 2602. In particular, the scene-based image editing system 106 utilizes a pipeline of a neural network to identify and remove interfering objects from the digital image 2602. In particular, the scene-based image editing system 106 generates object masks for objects in the digital image using the segmented neural network 2604. The scene-based image editing system 106 utilizes the interferent detection neural network 2606 to determine classifications for objects of the plurality of objects. More specifically, the scene-based image editing system 106 assigns a classification of subject or interfering objects to each object. The scene based image editing system 106 removes interfering objects from the digital image using object masking. In addition, the scene-based image editing system 106 utilizes the repair neural network 2610 to generate a content fill for the portion of the digital image 2602 from which the interfering object is removed to generate a modified digital image 2612. As shown, the scene-based image editing system 106 deletes a variety of different types of interfering objects (multiple men and bars). In practice, the scene-based image editing system 106 is robust enough to identify non-human objects as disturbances (e.g., poles behind girls).

In one or more embodiments, the scene-based image editing system 106 utilizes a subset of the neural network shown in fig. 26 to generate a modified digital image. For example, in some cases, scene-based image editing system 106 utilizes segmented neural network 2604, interferent detection neural network 2606, and content padding 210 to generate a modified digital image from the digital image. Furthermore, in some cases, scene-based image editing system 106 utilizes a different order of neural networks than shown.

Fig. 27 illustrates an architecture of an interferent detection neural network 2700 that is used by the scene-based image editing system 106 to identify and classify interfering objects in digital images in accordance with one or more embodiments. As shown in fig. 27, the interferent detection neural network 2700 includes a heat map network 2702 and an interferent classifier 2704.

As shown, a heat map network 2702 operates on an input image 2706 to generate a heat map 2708. For example, in some cases, heat map network 2702 generates a subject heat map representing potential subject objects and an interferent heat map representing potential interfering objects. In one or more embodiments, the heat map (also referred to as a class activation map) includes predictions made by a convolutional neural network that indicate, in a scale of 0 to 1, a probability value that a particular pixel of an image belongs to a particular class in a set of classes. In contrast to object detection, the goal of a heat map network is to classify individual pixels as part of the same region in some cases. In some cases, the region includes a region in which all pixels in the digital image have the same color or brightness.

In at least one implementation, scene-based image editing system 106 trains heatmap network 2702 over an entire image, including digital images without interfering objects and digital images depicting subject objects and interfering objects.

In one or more embodiments, the heat map network 2702 identifies features in the digital image that help draw conclusions, such as body pose and orientation, that a given region is more likely to be a disturbing object or more likely to be a subject object. For example, in some cases, heat map network 2702 determines that an object having an unrefined picking gesture opposite to a standing-in gesture is likely to be a disturbing object, and also determines that an object facing away from the camera is likely to be a disturbing object. In some cases, the heatmap network 2702 considers other features, such as size, intensity, color, and the like.

In some embodiments, heat map network 2702 classifies regions of input image 2706 as primary objects or interferents and outputs heat map 2708 based on the classification. For example, in some embodiments, heat map network 2702 represents any pixels determined to be part of a subject object as white within a subject heat map and any pixels determined to not be part of a subject object as black (or vice versa). Similarly, in some cases, heat map network 2702 represents any pixels determined to be part of an interfering object as white within the interferent heat map, and any pixels determined not to be part of an interfering object as black (or vice versa).

In some implementations, heat map network 2702 also generates a background heat map representing a possible background as part of heat map 2708. For example, in some cases, heat map network 2702 determines that the background includes an area that is not part of the subject object or the interfering object. In some cases, heat map network 2702 represents any pixels within the background heat map that are determined to be part of the background as white, and any pixels that are determined to be not part of the background as black (and vice versa).

In one or more embodiments, the interferent detection neural network 2700 utilizes the heat map 2708 output by the heat map network 2702 as a heat map prior to the interferent classifier 2704 to indicate a probability that a particular region of the input image 2706 contains an interfering object or subject object.

In one or more embodiments, the interferent detection neural network 2700 utilizes the interferent classifier 2704 to consider global information included in the heat map 2708 and local information included in the one or more individual objects 2710. To illustrate, in some embodiments, interferent classifier 2704 generates a score for classification of the object. If an object in the digital image appears to be a subject object based on the local information, but the heat map 2708 indicates with high probability that the object is an interfering object, the interference classifier 2704 concludes in some cases that the object is indeed an interfering object. On the other hand, if the heat map 2708 points to an object that is a subject object, the interferent classifier 2704 determines that the object has been confirmed as a subject object.

As shown in fig. 27, interferent classifier 2704 includes crop generator 2712 and hybrid classifier 2714. In one or more embodiments, interferent classifier 2704 receives one or more individual objects 2710 that have been identified from input image 2706. In some cases, one or more individual objects 2710 are identified via user annotations or some object detection network (e.g., object detection machine learning model 308 discussed above with reference to fig. 3).

As shown in fig. 27, interferent classifier 2704 utilizes crop generator 2712 to generate crop image 2716 by cropping input image 2706 based on the position of one or more individual objects 2710. For example, in the case where there are three object detections in the input image 2706, the clip generator 2712 generates three clip images—a clip image for each detected object. In one or more embodiments, crop generator 2712 generates a crop image by removing all pixels of input image 2706 outside the location of the respective inferred boundary region.

As further shown, interferent classifier 2704 also utilizes crop generator 2712 to generate crop heat map 2718 by cropping heat map 2708 with respect to each detected object. For example, in one or more embodiments, crop generator 2712 generates a crop heat map for each detected object based on an area within the heat map that corresponds to the location of the detected object from each of the body heat map, the interferent heat map, and the background heat map.

In one or more embodiments, for each individual object of the one or more individual objects 2710, the interferent classifier 2704 utilizes the hybrid classifier 2714 to operate on the corresponding cropped image (e.g., its features) and the corresponding cropping heat map (e.g., its features) to determine whether the object is a subject object or an interfering object. To illustrate, in some embodiments, for a detected object, hybrid classifier 2714 performs operations on a cropped image associated with the detected object and a cropped heat map associated with the detected object (e.g., a cropped heat map derived from heat map 2708 based on the position of the detected object) to determine whether the detected object is a subject object or an interfering object. In one or more embodiments, the interferent classifier 2704 combines features of the detected cropped image of the object with features of the corresponding cropped heat map (e.g., via a connection or additional features) and provides the combination to the hybrid classifier 2714. As shown in fig. 27, the hybrid classifier 2714 generates a binary decision 2720 from its corresponding clip image and clip heat map, which includes a tag for a detected object as a subject object or an interfering object.

Fig. 28 illustrates an architecture of a thermal map network 2800 that is used as part of an interferent detection neural network by the scene-based image editing system 106 in accordance with one or more embodiments. As shown in fig. 28, the heatmap network 2800 includes a convolutional neural network 2802 as its encoder. In one or more embodiments, convolutional neural network 2802 includes a depth residual network. As further shown in fig. 28, the thermal map network 2800 includes a thermal map header 2804 as its decoder.

FIG. 29 illustrates an architecture of a hybrid classifier 2900 that is used as part of an interferent detection neural network by a scene-based image editing system 106 in accordance with one or more embodiments. As shown in fig. 29, hybrid classifier 2900 includes a convolutional neural network 2902. In one or more embodiments, hybrid classifier 2900 uses convolutional neural network 2902 as an encoder.

To illustrate, in one or more embodiments, the scene-based image editing system 106 provides features of the cropped image 2904 to the convolutional neural network 2902. In addition, the scene-based image editing system 106 provides features of the cropping heat map 2906 corresponding to the objects of the cropping image 2904 to the inner layer 2910 of the hybrid classifier 2900. Specifically, as shown, in some cases, the scene-based image editing system 106 connects the features of the crop heat map 2906 with the output of the previous internal layers (via cascading operation 2908) and provides the resulting feature map to the internal layers 2910 of the hybrid classifier 2900. In some embodiments, the signature includes 2048+n channels, where N corresponds to the channels of the output of the heat map network, and 2048 corresponds to the channels of the output of the previous inner layer (although 2048 is one example).

As shown in fig. 29, the hybrid classifier 2900 convolves the output of the inner layer 2910 to reduce the channel depth. In addition, hybrid classifier 2900 performs another convolution on the output of subsequent inner layer 2914 to further reduce channel depth. In some cases, hybrid classifier 2900 applies pooling to the output of the final internal layer 2916 preceding binary classification header 2912. For example, in some cases, hybrid classifier 2900 averages the values of the final inner layer output to generate an average value. In some cases where the average value is above the threshold, hybrid classifier 2900 classifies the corresponding object as an interfering object and outputs a corresponding binary value; otherwise, hybrid classifier 2900 classifies the corresponding object as a subject object and outputs the corresponding binary value (or vice versa). Thus, hybrid classifier 2900 provides output 2918 containing labels for corresponding objects.

FIGS. 30A-30C illustrate a graphical user interface implemented by the scene-based image editing system 106 to identify and remove interfering objects from digital images in accordance with one or more embodiments. For example, as shown in fig. 30A, the scene-based image editing system 106 provides a digital image 3006 for display within a graphical user interface 3002 of the client device 3004. As further shown, the digital image 3006 depicts an object 3008 and a plurality of additional objects 3010a-3010d.

In addition, as shown in FIG. 30A, the scene based image editing system 106 provides a progress indicator 3012 for display within the graphical user interface 3002. In some cases, scene based image editing system 106 provides progress indicator 3012 to indicate that digital image 3006 is being analyzed for interfering objects. For example, in some embodiments, the scene-based image editing system 106 provides a progress indicator 3012 while utilizing an interferent detection neural network to identify and classify interfering objects within the digital image 3006. In one or more embodiments, the scene-based image editing system 106 automatically implements the jammer detection neural network upon receiving the digital image 3006 and prior to receiving user input for modifying one or more of the objects 3010a-3010 d. However, in some implementations, the scene-based image editing system 106 waits for user input to be received and then analyzes the digital image 3006 for interfering objects.

As shown in FIG. 30B, scene based image editing system 106 provides visual indicators 3014a-3014d for display within graphical user interface 3002 upon completion of an analysis. In particular, scene based image editing system 106 provides visual indicators 3014a-3014d to indicate that objects 3010a-3010d have been classified as interfering objects.

In one or more embodiments, scene-based image editing system 106 also provides visual indicators 3014a-3014d to indicate that objects 3010a-3010d have been selected for deletion. In some cases, scene-based image editing system 106 also presents pre-generated object masks for objects 3010a-3010d in preparation for deleting the objects. In fact, as already discussed, the scene-based image editing system 106 pre-generates object masking and content filling (e.g., using the above-referenced segmentation neural network 2604 and repair neural network 2610) for objects of the digital image. Thus, scene-based image editing system 106 has object masking and content filling that are readily used to modify objects 3010a-3010 d.

In one or more embodiments, the scene-based image editing system 106 enables user interactions to be added to or removed from the selection of objects to be deleted. For example, in some embodiments, upon detecting a user interaction with object 3010a, scene-based image editing system 106 determines to omit object 3010a from the delete operation. In addition, scene-based image editing system 106 removes visual indication 3014a from the display of graphical user interface 3002. On the other hand, in some implementations, the scene-based image editing system 106 detects user interactions with the object 3008 and, in response, determines to include the object 3008 in the delete operation. Furthermore, in some cases, scene-based image editing system 106 provides visual indications for object 3008 for display and/or presentation of pre-generated object masks for object 3008 in preparation for deletion.

As further shown in fig. 30B, scene based image editing system 106 provides remove option 3016 for display within graphical user interface 3002. In one or more embodiments, in response to detecting user interaction with the remove option 3016, the scene-based image editing system 106 removes objects that have been selected for deletion (e.g., objects 3010a-3010d that have been classified as interfering objects). In effect, as shown in FIG. 30C, scene-based image editing system 106 removes objects 3010a-3010d from digital image 3006. Further, as shown in 30C, upon removal of the objects 3010a-3010d, the scene-based image editing system 106 exposes previously generated content fills 3018a-3018d.

By enabling user interaction to control which objects are included in the delete operation and further selecting when to remove selected objects, scene-based image editing system 106 provides greater flexibility. Indeed, while conventional systems typically automatically delete interfering objects without user input, the scene-based image editing system 106 allows the interfering objects to be deleted according to user preferences expressed via user interaction. Thus, the scene-based image editing system 106 flexibly allows for control of the removal process via user interaction.

In various embodiments, in addition to removing interfering objects identified via the interferer detection neural network, the scene-based image editing system 106 provides other features for removing unwanted portions of the digital image. For example, in some cases, scene-based image editing system 106 provides a tool whereby user interactions can be deleted for any portion of the digital image. 31A-31C illustrate a graphical user interface implemented by the scene-based image editing system 106 to identify and remove interfering objects from digital images in accordance with one or more embodiments.

In particular, fig. 31A shows a digital image 3106 displayed on a graphical user interface 3102 of a client device 3104. The digital image 3106 corresponds to the digital image 3006 of fig. 30C after the interfering object identified by the interfering object detection neural network has been removed. Thus, in some cases, the objects remaining in the digital image 3106 represent those objects that are not identified as interfering objects and are removed. For example, in some cases, the set of objects 3110 near the horizon of the digital image 3106 includes objects that are not identified as interfering objects by the interferent detection neural network.

As further shown in fig. 31A, scene-based image editing system 106 provides a brush tool option 3108 for display within graphical user interface 3102. Fig. 31B illustrates that upon detecting a user interaction with the brush tool option 3108, the scene-based image editing system 106 enables one or more user interactions to use the brush tool to select any portion of the digital image 3106 (e.g., a portion not identified by the interferent detection neural network) for removal. For example, as shown, the scene-based image editing system 106 receives one or more user interactions with the graphical user interface 3102 that target a portion of the digital image 3106 depicting the set of objects 3110.

As shown in fig. 31B, with the brush tool, scene-based image editing system 106 allows free-form user input in some cases. In particular, fig. 31B illustrates that scene-based image editing system 106 provides visual indication 3112 representing a portion (e.g., a target specific pixel) of digital image 3106 selected via a brush tool. In fact, rather than receiving user interactions with previously identified objects or other pre-segmented semantic regions, the scene-based image editing system 106 uses a brush tool to enable arbitrary selection of portions of the digital image 3106. Thus, scene-based image editing system 106 utilizes a brush tool to provide additional flexibility whereby user interaction can specify undesirable areas of a digital image that may not be identified by machine learning.

As further shown in fig. 31B, scene based image editing system 106 provides remove option 3114 for display within graphical user interface 3102. As shown in fig. 31C, in response to detecting a user interaction with the remove option 3114, the scene-based image editing system 106 removes a selected portion of the digital image 3106. Further, as shown, scene-based image editing system 106 populates the selected portion with content populating 3116. In one or more embodiments, where the portion removed from digital image 3106 does not include an object for which a content fill was previously selected (or otherwise includes additional pixels not included in the previously generated content fill), scene-based image editing system 106 generates content fill 3116 after removing the portion of digital image 3106 selected via the brush tool. Specifically, after removing the selected portion, scene-based image editing system 106 generates content fill 3116 using the content-aware hole-fill machine learning model.

In one or more embodiments, the scene-based image editing system 106 also implements smart augmentation (smart condition) when objects are removed from the digital image. For example, in some cases, the scene-based image editing system 106 utilizes intelligent dilation to remove objects that touch, overlap, or otherwise come near other objects depicted in the digital image. FIG. 32A illustrates a scene-based image editing system 106 utilizing intelligent dilation to remove objects from a digital image in accordance with one or more embodiments.

In general, conventional systems utilize tight masking (e.g., masking that is tightly adhered to the boundary of a corresponding object) to remove the object from the digital image. In many cases, however, the digital image includes bleeding or artifacts around the boundary of the object. For example, there are some image formats (JPEG) which are particularly susceptible to the presence of format-dependent artifacts around object boundaries. When these problems occur, the use of tight masking can produce adverse effects in the generated image. For example, repair models are often sensitive to these image imperfections, and can produce significant artifacts when directly manipulating the segmentation output. Thus, the resulting modified image inaccurately captures the user's intent over creating additional image noise to remove the object.

Thus, when an object is removed, the scene-based image editing system 106 expands (e.g., expands) the object mask of the object to avoid associated artifacts. However, dilated object masking risks removing portions of other objects depicted in the digital image. For example, in the event that a first object to be removed overlaps, contacts, or is in proximity to a second object, the expanding mask of the first object will typically extend into the space occupied by the second object. Thus, when the first object is removed using the dilated object mask, a larger portion of the second object is typically removed and the resulting hole is (typically incorrectly) filled, causing an undesirable effect in the resulting image. Thus, the scene-based image editing system 106 utilizes intelligent dilation to avoid significant expansion of object masking of objects to be removed into the region of the digital image occupied by other objects.

As shown in fig. 32A, the scene-based image editing system 106 determines to remove the object 3202 depicted in the digital image 3204. For example, in some cases, the scene-based image editing system 106 determines (e.g., via an interferent detection neural network) that the object 3202 is an interfering object. In some implementations, the scene-based image editing system 106 receives a user selection of an object 3202 to remove. Digital image 3204 also depicts objects 3206a-3206b. As shown, the object 3202 selected for removal overlaps with the object 3206b in the digital image 3204.

As further shown in fig. 32A, scene-based image editing system 106 generates object masks 3208 for objects 3202 to be removed and combined object masks 3210 for objects 3206a-3206b. For example, in some embodiments, scene-based image editing system 106 generates object mask 3208 and combined object mask 3210 from digital image 3204 using a segmented neural network. In one or more embodiments, scene-based image editing system 106 generates combined object masks 3210 by generating object masks for each of objects 3206a-3206b and determining a union between the individual object masks.

Further, as shown in fig. 32A, scene-based image editing system 106 performs an action 3212 of expanding object mask 3208 for object 3202 to be removed. In particular, scene-based image editing system 106 expands the representation of objects 3202 within object mask 3208. In other words, scene-based image editing system 106 adds pixels to boundaries of the representation of objects within object mask 3208. The amount of expansion varies in various embodiments and is configurable in some implementations to accommodate user preferences. For example, in one or more implementations, scene-based image editing system 106 expands object masks by expanding object masks outward by 10, 15, 20, 25, or 30 pixels.

After extending object mask 3208, scene-based image editing system 106 performs an act 3214 of detecting an overlap between the extended object mask of object 3202 and the object masks of other detected objects 3206a-3206b (i.e., combined object mask 3210). In particular, scene-based image editing system 106 determines where pixels corresponding to the extended representation of object 3202 within the extended object mask overlap with pixels corresponding to objects 3206a-3206b within combined object mask 3210. In some cases, scene-based image editing system 106 determines a union between extended object mask and combined object mask 3210 and uses the resulting union to determine the overlap. Scene-based image editing system 106 also performs an act 3216 of removing the overlapping portion from the extended object mask of object 3202. In other words, scene-based image editing system 106 removes pixels from the representation of object 3202 within the extended object mask that overlap pixels within combined object mask 3210 that correspond to object 3206a and/or object 3206 b.

Thus, as shown in fig. 32A, the scene-based image editing system 106 generates a smart augmented object mask 3218 (e.g., an extended object mask) for the object 3202 to be removed. In particular, scene-based image editing system 106 generates intelligently-expanded object masks 3218 by expanding object masks 3208 in areas that do not overlap any of objects 3206a-3206b and avoiding expansion in areas that overlap at least one of objects 3206a-3206 b. In at least some implementations, the scene-based image editing system 106 reduces expansion in overlapping areas. For example, in some cases, the intelligently-expanded object mask 3218 still includes an expansion in the overlapping region, but the expansion is significantly smaller when compared to the region where no overlap exists. In other words, scene-based image editing system 106 expands using fewer pixels in areas where there is overlap. For example, in one or more implementations, the scene-based image editing system 106 expands or dilates the object mask by a factor of 5, 10, 15, or 20 in areas where no overlap exists as compared to areas where overlap exists.

In other words, in one or more embodiments, scene-based image editing system 106 generates intelligently-expanded object masks 3218 (e.g., expanded object masks) by expanding object masks 3208 of objects 3202 to object mask unoccupied areas of objects 3206a-3206b (e.g., areas of objects 3206a-3206b themselves are unoccupied). For example, in some cases, the scene-based image editing system 106 expands the object mask 3208 into portions of the digital image 3204 that adjoin the object mask 3208. In some cases, scene-based image editing system 106 expands object mask 3208 to an adjoining portion of a set number of pixels. In some implementations, the scene-based image editing system 106 utilizes a different number of pixels to extend the object mask 3208 into different contiguous portions (e.g., based on detecting overlapping regions between the object mask 3208 and other object masks).

To illustrate, in one or more embodiments, the scene-based image editing system 106 expands the object mask 3208 into the foreground and background of the digital image 3204. In particular, the scene-based image editing system 106 determines the foreground by combining object masking of objects that are not deleted. The scene-based image editing system 106 expands the object mask 3208 to contiguous foreground and background. In some implementations, the scene-based image editing system 106 expands the object mask 3208 to a first amount in the foreground and expands the object mask 3208 to a second amount in the background that is different from the first amount (e.g., the second amount is greater than the first amount). For example, in one or more implementations, scene-based image editing system 106 expands object masks 20 pixels into the background region and expands two pixels into the foreground region (to contiguous object masks, such as combined object mask 3210).

In one or more embodiments, the scene-based image editing system 106 determines a first amount for extending the object mask 3208 into the foreground by a second amount—the same amount as is used to extend the object mask 3208 into the background. In other words, the scene-based image editing system 106 expands the object mask 3208 as a whole to the same amount in the foreground and background (e.g., using the same number of pixels). The scene-based image editing system 106 also determines an overlap region between the extended object mask and the object mask (e.g., the combined object mask 3210) corresponding to the other objects 3206a-3206 b. In one or more embodiments, the overlapping region exists in a foreground of the digital image 3204 adjacent to the object mask 3208. Thus, the scene-based image editing system 106 reduces the expansion of the object mask 3208 into the foreground such that the expansion corresponds to the second amount. Indeed, in some cases, scene-based image editing system 106 removes overlapping regions (e.g., removes overlapping pixels) from the extended object mask of object 3202. In some cases, scene-based image editing system 106 removes a portion of the overlapping region, but not the entire overlapping region, resulting in a reduced overlap between the extended object mask of object 3202 and the object masks corresponding to objects 3206a-3206 b.

In one or more embodiments, because removing object 3202 includes removing foreground and background adjacent to intelligently-expanded object mask 3218 (e.g., expanded object mask) generated for object 3202, scene-based image editing system 106 repairs remaining holes after removal. Specifically, scene-based image editing system 106 repairs the hole with foreground pixels and background pixels. Indeed, in one or more embodiments, scene-based image editing system 106 utilizes a repair neural network to generate foreground pixels and background pixels for the generated hole and utilizes the generated pixels to repair the hole, thereby generating a modified digital image (e.g., a repair digital image) in which object 3202 has been removed and the corresponding portion of digital image 3204 has been filled.

For example, FIG. 32B illustrates the advantages provided by intelligently expanding object masking before performing repair. In particular, fig. 32B illustrates that when a smart expanded object mask 3218 (e.g., an expanded object mask) is provided to a repair neural network (e.g., cascade modulation repair neural network 420) as an area to be filled, the repair neural network generates a modified digital image 3220, wherein the area corresponding to the smart expanded object mask 3218 fills pixels generated by the repair neural network. As shown, the modified digital image 3220 does not include artifacts in the repair area corresponding to the smart augmented object mask 3218. In effect, the modified digital image 3220 provides a realistic looking image.

In contrast, fig. 32B illustrates that when an object mask 3208 (e.g., an unexpanded object mask) is provided to a repair neural network (e.g., cascade modulation repair neural network 420) as an area to be filled, the repair neural network generates a modified digital image 3222, wherein the area corresponding to the intelligently expanded object mask 3218 fills in pixels generated by the repair neural network. As shown, the modified digital image 3222 includes artifacts in the repair area corresponding to the object mask 3208. Specifically, in the generated water, there are artifacts along the back of the girl and the event.

By generating intelligently-expanded object masks, the scene-based image editing system 106 provides improved image results when objects are removed. In effect, the scene-based image editing system 106 utilizes expansion to remove artifacts, color loss, or other undesirable errors in the digital image, but avoids removing significant portions of other objects that remain in the digital image. Thus, the scene-based image editing system 106 is able to fill in the hole left by the removed object, if possible, without enhancing the current error, without unnecessarily replacing portions of the remaining other objects.

As previously described, in one or more embodiments, scene-based image editing system 106 also utilizes a shadow detection neural network to detect shadows associated with interfering objects depicted within the digital image. Fig. 33-38 illustrate diagrams of a shadow detection neural network used by a scene-based image editing system 106 to detect shadows associated with an object in accordance with one or more embodiments.

In particular, FIG. 33 illustrates an overview of a shadow detection neural network 3300 in accordance with one or more embodiments. In practice, as shown in fig. 33, the shadow detection neural network 3300 analyzes the input image 3302 through a first stage 3304 and a second stage 3310. Specifically, first stage 3304 includes an instance segmentation component 3306 and an object perception component 3308. In addition, the second stage 3310 includes a shadow prediction component 3312. In one or more embodiments, the example segmentation component 3306 includes the segmented neural network 2604 of the neural network pipeline discussed above with reference to fig. 26.

As shown in FIG. 33, after analyzing the input image 3302, the shadow detection neural network 3300 identifies the objects 3314a-3314c and shadows 3316a-3316c depicted therein. In addition, shadow detection neural network 3300 associates objects 3314a-3314c with their respective shadows. For example, shadow detection neural network 3300 associates object 3314a with shadow 3316a, and similarly associates other objects with shadows. Thus, when its associated object is selected for deletion, movement, or some other modification, shadow detection neural network 3300 facilitates the inclusion of shadows.

FIG. 34 illustrates an overview of an example segmentation component 3400 of a shadow detection neural network in accordance with one or more embodiments. As shown in fig. 34, an instance segmentation component 3400 implements an instance segmentation model 3402. As shown in FIG. 34, the instance segmentation component 3400 utilizes the instance segmentation model 3402 to analyze the input image 3404 and identify objects 3406a-3406c depicted therein based upon the analysis. For example, in some cases, the scene-based image editing system 106 outputs object masks and/or bounding boxes for the objects 3406a-3406c.

FIG. 35 illustrates an overview of an object perception component 3500 of a shadow detection neural network in accordance with one or more embodiments. In particular, FIG. 35 shows input image instances 3502a-3502c corresponding to each object detected within the digital image via the previous instance segmentation component. In particular, each input image instance corresponds to a different detected object and corresponds to an object mask and/or bounding box generated for the digital image. For example, input image instance 3502a corresponds to object 3504a, input image instance 3502b corresponds to object 3504b, and input image instance 3502c corresponds to object 3504c. Thus, input image instances 3502a-3502c illustrate separate object detection provided by an instance segmentation component of a shadow detection neural network.

In some embodiments, for each detected object, scene-based image editing system 106 generates input for the second stage of the shadow detection neural network (i.e., the shadow prediction component). Fig. 35 illustrates an object perception component 3500 that generates input 3506 for an object 3504 a. In effect, as shown in FIG. 35, object perception component 3500 generates input 3506 using input image 3508, object masks 3510 (referred to as object perception channels) corresponding to objects 3504a, and combined object masks 3512 (referred to as object discrimination channels) corresponding to objects 3504b-3504 c. For example, in some implementations, object perception component 3500 combines (e.g., connects) input image 3508, object mask 3510, and combined object mask 3512. Object awareness component 3500 similarly generates a second level of input for other objects 3504b-3504c as well (e.g., utilizing their respective object masks and a combined object mask representing the other objects along with input image 3508).

In one or more embodiments, scene-based image editing system 106 generates combined object mask 3512 (e.g., via object perception component 3500 or some other component of a shadow detection neural network) using a union (unit) of individual object masks generated for object 3504b and object 3504 c. In some cases, object awareness component 3500 does not utilize object discrimination channels (e.g., combined object mask 3512). Rather, object perception component 3500 uses input image 3508 and object mask 3510 to generate input 3506. However, in some embodiments, using object discrimination channels provides better shadow prediction in the second stage of the shadow detection neural network.

FIG. 36 illustrates an overview of a shadow prediction component 3600 of a shadow detection neural network in accordance with one or more embodiments. As shown in fig. 36, the shadow detection neural network provides inputs to a shadow prediction component 3600 that are compiled by an object-aware component that includes an input image 3602, an object mask 3604 for objects of interest, and a combined object mask 3606 for other detected objects. Shadow prediction component 3600 utilizes shadow segmentation model 3608 to generate a first shadow prediction 3610 for an object of interest and a second shadow prediction 3612 for other detected objects. In one or more embodiments, first shadow prediction 3610 and/or second shadow prediction 3612 includes shadow masking for a corresponding shadow (e.g., where shadow masking includes object masking for a shadow). In other words, shadow prediction component 3600 utilizes shadow segmentation model 3608 to generate first shadow prediction 3610 by generating shadow masks for predicted shadows of an object of interest. Similarly, shadow prediction component 3600 utilizes shadow segmentation model 3608 to generate second shadow prediction 3612 by generating a combined shadow mask for shadows predicted for other detected objects.

Based on the output of the shadow segmentation model 3608, the shadow prediction component 3600 provides object-shadow pair prediction 3614 for an object of interest. In other words, shadow prediction component 3600 associates an object of interest with its shadow cast within input image 3602. In one or more embodiments, shadow prediction component 3600 similarly generates object-shadow pair predictions for all other objects depicted in input image 3602. Accordingly, shadow prediction component 3600 identifies shadows depicted in a digital image and associates each shadow with its corresponding object.

In one or more embodiments, the shadow segmentation model 3608 used by the shadow prediction component 3600 includes a segmented neural network. For example, in some cases, the shadow segmentation model 3608 includes the detection masking neural network discussed above with reference to fig. 3. As another embodiment, the shadow segmentation model 3608 includes a deep labv3 semantic segmentation model described by: liang-Chieh Chen et al Rethinking Atrous Convolution for Semantic Image Segmentation (re-thinking Atrous convolution for semantic image segmentation), arXiv:1706.05587 2017, or by deep lab semantic segmentation model described below: liang-Chieh Chen et al, "deep: semantic Image Segmentation with Deep Convolutional Nets, atrous Convolution, and Fully Connected CRFs" (deep: semantic image segmentation based on deep convolutional networks, atrous convolution, and fully connected CRF), arXiv:1606.00915, 2016, the entire contents of both of which are incorporated herein by reference.

Fig. 37 illustrates an overview of an architecture of a shadow detection neural network 3700 in accordance with one or more embodiments. In particular, FIG. 37 illustrates a shadow detection neural network 3700 that includes the example segmentation component 3400 discussed with reference to FIG. 34, the object perception component 3500 discussed with reference to FIG. 35, and the shadow prediction component 3600 discussed with reference to FIG. 36. Further, fig. 37 shows that the shadow detection neural network 3700 generates object masking, shadow masking, and prediction with respect to each object depicted in the input image 3702. Thus, shadow detection neural network 3700 outputs a final prediction 3704 that associates each object depicted in the digital image with its shadow. Thus, as shown in fig. 37, the shadow detection neural network 3700 provides an end-to-end neural network framework that receives the digital image and outputs an association between the object and the shadow described therein.

In some implementations, the shadow detection neural network 3700 determines that an object depicted within the digital image does not have an associated shadow. Indeed, in some cases, the shadow detection neural network 3700 determines that shadows associated with an object are not depicted within the digital image when the digital image is analyzed with its various components. In some cases, scene-based image editing system 106 provides feedback indicating the lack of shadows. For example, in some cases, upon determining that no shadow is depicted within the digital image (or that no shadow associated with a particular object is present), the scene-based image editing system 106 provides a message for display or other feedback indicating the lack of shadow. In some cases, scene-based image editing system 106 does not provide explicit feedback, but does not automatically select or provide suggestions for including shadows within the object selection, as discussed below with reference to fig. 39A-39C.

In some implementations, when an object mask for an object has been generated, the scene-based image editing system 106 utilizes a second level (second stage) of the shadow detection neural network to determine shadows associated with the object depicted in the digital image. In effect, FIG. 38 illustrates a diagram for determining shadows associated with an object depicted in a digital image using a second stage of a shadow detection neural network in accordance with one or more embodiments.

As shown in fig. 38, the scene-based image editing system 106 provides the input image 3804 to a second stage of the shadow detection neural network (i.e., the shadow prediction model 3802). In addition, scene-based image editing system 106 provides object masking 3806 to the second level. The scene-based image editing system 106 utilizes a second level of shadow detection neural network to generate shadow masks 3808 for shadows of objects depicted in the input image 3804, resulting in an association between the objects and shadows cast by objects within the input image 3804 (e.g., as shown in the visualization 3810).

By providing direct access to the second level of the shadow detection neural network, the scene-based image editing system 106 provides flexibility in shadow detection. Indeed, in some cases, object masks have been created for objects depicted in digital images. For example, in some cases, the scene-based image editing system 106 implements a separate segmented neural network to generate object masks for digital images as part of a separate workflow. Thus, object masking of the object already exists and the scene-based image editing system 106 utilizes previous work to determine the shadow of the object. Thus, the scene-based image editing system 106 also provides efficiency in that it avoids duplication by directly accessing the shadow prediction model of the shadow detection neural network.

39A-39C illustrate a graphical user interface implemented by the scene-based image editing system 106 to identify and remove shadows of objects depicted in a digital image, in accordance with one or more embodiments. In effect, as shown in fig. 39A, the scene-based image editing system 106 provides a digital image 3906 depicting an object 3908 for display within a graphical user interface 3902 of a client device. As further shown, object 3908 casts shadow 3910 within digital image 3906.

In one or more embodiments, upon receiving digital image 3906, scene-based image editing system 106 utilizes a shadow-detection neural network to analyze digital image 3906. In particular, scene-based image editing system 106 utilizes a shadow-detection neural network to identify object 3908, identify shadows 3910 cast by object 3908, and further associate shadows 3910 with object 3908. As previously described, in some implementations, scene-based image editing system 106 also utilizes a shadow detection neural network to generate object masks for object 3908 and shadow 3910.

As previously discussed with reference to fig. 26, in one or more embodiments, the scene-based image editing system 106 identifies shadows cast by objects within the digital image as part of a neural network pipeline for identifying interfering objects within the digital image. For example, in some cases, the scene-based image editing system 106 uses a segmentation neural network to identify objects of the digital image, uses an interferent detection neural network to classify one or more objects as interfering objects, uses a shadow detection neural network to identify shadows and associate shadows with their corresponding objects, and uses a repair neural network to generate content fills to replace the removed objects (and shadows thereof). In some cases, the scene-based image editing system 106 automatically implements neural network pipelining in response to receiving the digital image.

In effect, as shown in fig. 39B, the scene-based image editing system 106 provides a visual indication 3912 for display within the graphical user interface 3902, the visual indication 3912 indicating a selection of an object 3908 to remove. As further shown, scene-based image editing system 106 provides visual indication 3914 for display, visual indication 3914 indicating selection of shadows 3910 to remove. As suggested, in some cases, scene-based image editing system 106 automatically (e.g., upon determining that object 3908 is an interfering object) selects object 3908 and shadow 3910 for deletion. However, in some implementations, scene-based image editing system 106 selects object 3908 and/or shadow 3910 in response to receiving one or more user interactions.

For example, in some cases, scene-based image editing system 106 receives a user's selection of object 3908 and automatically adds shadow 3910 to the selection. In some implementations, the scene-based image editing system 106 receives a user selection of an object 3908 and provides suggestions for display in the graphical user interface 3902 that suggest shadows 3910 to add to the selection. In response to receiving the additional user interaction, scene-based image editing system 106 adds shadow 3910.

As further shown in fig. 39B, the scene-based image editing system 106 provides a remove option 3916 for display within the graphical user interface 3902. As shown in fig. 39C, upon receiving a selection of remove option 3916, scene-based image editing system 106 removes object 3908 and shadow 3910 from the digital image. As further shown, scene-based image editing system 106 replaces object 3908 with content fill 3918 and replaces shadow 3910 with content fill 3920. In other words, scene-based image editing system 106 exposes content fill 3918 and content fill 3920, respectively, when object 3908 and shadow 3910 are removed.

While fig. 39A-39C illustrate implementation of shadow detection for delete operations, it should be noted that in various embodiments, scene-based image editing system 106 implements shadow detection for other operations (e.g., move operations). Further, while fig. 39A-39C are discussed with respect to removing interfering objects from digital images, the scene-based image editing system 106 enables shadow detection in the context of other features described herein. For example, in some cases, scene-based image editing system 106 enables object-aware modified shadow detection that directly targets objects with respect to user interactions. Thus, the scene-based image editing system 106 provides further advantages for object-aware modification by segmenting objects and their shadows and generating corresponding content fills to modify the objects to allow seamless interaction with digital images, and thus seamless interaction with digital images, prior to receiving user interactions.

By identifying shadows cast by objects within the digital image, the scene-based image editing system 106 provides improved flexibility over conventional systems. In practice, the scene-based image editing system 106 flexibly identifies objects within the digital image as well as other aspects of those objects depicted in the digital image (e.g., their shadows). Thus, the scene-based image editing system 106 provides better image results when removing or moving objects, as it accommodates these other aspects. This further results in reduced user interaction with the graphical user interface because the scene-based image editing system 106 does not require user interaction for moving or removing shadows of objects (e.g., user interaction for identifying shadow pixels and/or binding shadow pixels to objects).

In some implementations, the scene-based image editing system 106 implements one or more additional features to facilitate modification of the digital image. In some embodiments, these features provide additional user interface-based efficiencies in that they reduce the amount of user interaction with the user interface that is typically required to perform certain actions in the context of image editing. In some cases, these features also facilitate deployment of scene-based image editing systems 106 on computing devices with limited screen space, as they effectively use available space to facilitate image modification without crowding the display with unnecessary visual elements.

As described above, in one or more embodiments, the scene-based image editing system 106 provides editing of a two-dimensional ("2D") image based on three-dimensional ("3D") features of a two-dimensional ("2D") image scene. In particular, the scene-based image editing system 106 processes the two-dimensional image with a plurality of models to determine a three-dimensional understanding of the two-dimensional scene in the two-dimensional image. The scene-based image editing system 106 also provides tools for editing two-dimensional images, such as by moving objects within or inserting objects into the two-dimensional images.

FIG. 40 illustrates an overview of a scene-based image editing system 106 modifying a two-dimensional image by placing two-dimensional objects according to three-dimensional characteristics of a two-dimensional image scene. In particular, the scene-based image editing system 106 provides a tool for editing two-dimensional images 4000. For example, the scene-based image editing system 106 provides a tool for editing objects within the two-dimensional image 4000 or inserting objects into the two-dimensional image 4000 and generating shadows based on the edited image.

In one or more embodiments, the scene-based image editing system 106 detects a request to place a two-dimensional object 4002 at a selected location within the two-dimensional image 4000. For example, the scene-based image editing system 106 determines that the two-dimensional object 4002 includes an object detected within the two-dimensional image 4000. In an additional embodiment, the scene-based image editing system 106 determines that the two-dimensional object 4002 comprises an object that is inserted into the two-dimensional image 4000.

According to one or more embodiments, the scene in the two-dimensional image 4000 includes one or more foreground and/or one or more background objects. As an example, the two-dimensional image 4000 of fig. 40 includes a foreground object such as an automobile in a background including a landscape overhead. In some embodiments, the two-dimensional image includes one or more objects determined by the scene-based image editing system 106 to be part of the background and not having foreground objects.

In one or more embodiments, in connection with editing two-dimensional image 4000, scene-based image editing system 106 processes two-dimensional image 4000 to obtain a three-dimensional understanding of a scene in two-dimensional image 4000. For example, the scene-based image editing system 106 determines three-dimensional characteristics 4004 of a scene in the two-dimensional image 4000. In some embodiments, the scene-based image editing system 106 determines the three-dimensional characteristics 4004 by determining the relative position of objects in the two-dimensional image 4000 in three-dimensional space. To illustrate, the scene-based image editing system 106 determines three-dimensional characteristics 4004 by estimating depth values of pixels in the two-dimensional image 4000. In additional embodiments, the scene-based image editing system 106 generates a three-dimensional grid (or multiple three-dimensional grids) representing the scene in the two-dimensional image 4000.

According to one or more embodiments, as shown in FIG. 40, the scene-based image editing system 106 utilizes three-dimensional characteristics 4004 to generate a modified two-dimensional image 4006 comprising a two-dimensional object 4002 at a selected location. In particular, the scene-based image editing system 106 utilizes three-dimensional characteristics 4004 to determine a location to place a two-dimensional image relative to one or more other objects and to generate realistic shadows from a three-dimensional understanding of the scene. For example, the scene-based image editing system 106 utilizes three-dimensional characteristics 4004 to determine objects and a scene grid, and to determine the position of objects in three-dimensional space relative to one or more other objects in the scene.

Further, the scene-based image editing system 106 generates a modified two-dimensional image 4006 to include one or more shadows based on the position of the two-dimensional object 4002 relative to other object(s) in the scene. In particular, the scene-based image editing system 106 utilizes the three-dimensional characteristics 4004 of objects and scenes and image illumination parameters to generate a shadow map. Further, the scene-based image editing system 106 renders the modified two-dimensional image 4006 by merging shadow maps according to the relative positions of the two-dimensional object 4002 and one or more other objects in the scene (e.g., background objects) to determine the correct position, direction, and shape of one or more shadows in the modified two-dimensional image 4006. Thus, the scene-based image editing system 106 provides accurate movement, insertion, and shading of objects for editing of two-dimensional images according to the automatically determined three-dimensional characteristics of the two-dimensional scene.

In one or more embodiments, scene-based image editing system 106 provides an improvement over conventional systems that provide shadow generation and editing in digital images. For example, in contrast to conventional systems that use image-based shadow generation in two-dimensional images, the scene-based image editing system 106 utilizes three-dimensional understanding of the content in the two-dimensional images to generate shadows. In particular, the scene-based image editing system 106 may reconstruct a scene of a two-dimensional image in three-dimensional space for determining whether and how modifications to content in the two-dimensional image affect shadows within the scene. To illustrate, the scene-based image editing system 106 provides reliable shadow interactions while moving existing and virtual objects within a two-dimensional image according to a three-dimensional representation of a scene.

By generating a three-dimensional representation of a two-dimensional scene in a two-dimensional image, the scene-based image editing system 106 generates shadows in real-time from a three-dimensional understanding of the content for presentation within a graphical user interface. In particular, unlike conventional systems that utilize deep learning to generate shadows based on modified content in a two-dimensional image, the scene-based image editing system 106 provides updated shadows when a user interacts with a scene. Thus, the scene-based image editing system 106 may provide the user with an accurate rendering (rendering) of modifications to shadows within the two-dimensional image based on the three-dimensional context of foreground and background objects (e.g., based on the estimated depth and position of the objects). Furthermore, by generating proxy objects to represent existing objects in a scene of a two-dimensional image, a scene-based image editing system may provide efficient and accurate shadow generation based on changes made to corresponding objects in the scene.

As described above, the scene-based image editing system 106 utilizes multiple models to modify a two-dimensional image based on movement or insertion of objects within the scene of the two-dimensional image. In particular, the scene-based image editing system 106 utilizes multiple models to extract features of the two-dimensional image and determine the illumination and structure of the two-dimensional image. FIG. 41 illustrates an embodiment in which a scene-based image editing system 106 edits a two-dimensional image using multiple models to place objects within a scene of the two-dimensional image according to user interactions. More specifically, the scene-based image editing system 106 utilizes multiple models to generate updated two-dimensional shadows from illumination and three-dimensional characteristics extracted from the two-dimensional image.

As shown in fig. 41, the scene-based image editing system 106 may utilize various machine learning models or neural networks to generate modified digital images that depict manipulated shadows modeled from a three-dimensional scene extracted from a two-dimensional image. For example, fig. 41 shows a digital image 4102. As shown, the scene based image editing system 106 utilizes the depth estimation/refinement model 4104 to generate and/or refine a depth map for the digital image 4102. To illustrate, the scene-based image editing system 106 utilizes the depth estimation/refinement model 4104 to determine the per-pixel depth values depicted in the digital image 4102.

In one or more embodiments, the depth estimation/refinement model 4104 includes a depth estimation neural network to generate a depth map that includes per-pixel depth values of the view of the digital image 4102 relative to the digital image 4102. Specifically, the per-pixel depth value determines a relative depth/distance to a viewpoint of the digital image 4102 (e.g., camera viewpoint) based on a relative position of the object in the digital image 4102. To illustrate, the depth estimation model includes a monocular depth estimation model (e.g., a single image depth estimation model or edge) having a convolutional neural network structure. Alternatively, the depth estimation model generates the depth map using a transducer model and/or using a self-attention layer. For example, in one or more embodiments, scene-based image editing system 106 utilizes a depth estimation model as described below: "Generating Depth Images Utilizing A Machine-Learning Model Built From Mixed Digital Image Sources And Multiple Loss Function Sets" (generating depth images using a machine learning model built up of a hybrid digital image source and multiple loss function sets), U.S. patent application Ser. No. 17/186,436 filed on 26, 2, 2021, which is incorporated herein by reference in its entirety. In one or more embodiments, the depth estimation/refinement model 4104 generates a depth map by determining, for each pixel in the digital image 4102, a relative distance/depth of objects detected in the digital image 4102.

Further, in one or more embodiments, the depth estimation/refinement model includes a refinement model 4102 that refines the depth map of the digital image. In particular, the refinement model utilizes an image segmentation model of the digital image 4102 to generate (or otherwise obtain) a segmentation mask for the digital image 4102. To illustrate, the image segmentation model includes a convolutional neural network trained to segment a digital object from a digital image. In one or more embodiments, the scene-based image editing system utilizes an image segmentation model as described below: "Deep Salient Content Neural Networks for Efficient Digital Object Segmentation" (a deep highlight neural network for efficient digital object segmentation), U.S. patent application publication No. 2019/013129 filed on 10/31 in 2017, which is incorporated herein by reference in its entirety. Furthermore, in one or more embodiments, the refinement model includes the described model. Indeed, in one or more implementations, the scene-based image editing system 106 utilizes the methods described below: "UTILIZING MACHINE LEARNING MODELS TO GENERATE REFINED DEPTH MAPS WITH SEGMENTATION MASK GUIDANCE" (a machine learning model is used TO generate a refined depth map with segmentation masking guidance), U.S. patent application Ser. No. 17/658,873 filed on month 4 and 12 of 2022, which is incorporated herein by reference in its entirety.

Fig. 41 also shows that scene-based image editing system 106 utilizes three-dimensional scene representation model 4106 to generate three-dimensional scene representation 4108 of a scene depicted in digital image 4102. In one or more embodiments, the three-dimensional scene representation model 4106 includes one or more neural networks for generating one or more three-dimensional meshes representing objects in the digital image 4102. For example, the three-dimensional scene representation model 4106 utilizes the depth map to generate one or more three-dimensional meshes representing the scene in the digital image 4102. To illustrate, the three-dimensional scene representation model 4106 utilizes per-pixel depth values of the depth map to generate vertices and links edges with coordinates in three-dimensional space to represent the scene of the digital image 4102 in three-dimensional space according to a particular grid resolution (e.g., selected density or number of vertices/faces).

In some embodiments, scene-based image editing system 106 utilizes one or more neural networks to generate a three-dimensional grid representing one or more objects in digital image 4102. For example, the scene-based image editing system 106 generates a three-dimensional grid comprising mosaics from pixel depth values of the two-dimensional image and the estimated camera parameters. To illustrate, the scene-based image editing system 106 determines a mapping (e.g., projection) between the three-dimensional grid and pixels of the digital image 4102 for use in determining which portions of the three-dimensional grid to modify in connection with editing the digital image 4102. In an alternative embodiment, scene-based image editing system 106 utilizes density information-based depth displacement information and/or adaptive sampling from digital image 4102 to generate an adaptive three-dimensional grid representing the content of digital image 4102.

In one or more embodiments, scene-based image editing system 106 detects objects in digital image 4102. Specifically, in connection with generating the three-dimensional scene representation 4108 via the three-dimensional scene representation model 4106. For example, the scene-based image editing system 106 utilizes the object detection model 4110 to detect foreground and/or background objects in the digital image 4102 (e.g., via segmentation masks generated by the image segmentation model described above). In one or more embodiments, the object detection model 4110 includes or is part of an image segmentation model. Thus, the scene based image editing system 106 utilizes the object detection model 4110 to provide information associated with various objects in the digital image 4102 to the three-dimensional scene representation model 4106. The three-dimensional scene representation model 4106 utilizes the object information and the depth map to generate one or more three-dimensional meshes representing one or more detected objects in the digital image 4102. Thus, the three-dimensional scene representation model 4106 generates a three-dimensional scene representation 4108 via one or more three-dimensional meshes representing the content of the digital image 4102 from the depth map generated by the depth estimation/refinement model 4104.

Further, in one or more embodiments, scene-based image editing system 106 estimates camera/lighting parameters of digital image 4102. For example, the scene-based image editing system 106 utilizes an illumination/camera model 4112 that extracts illumination features 4114 and camera parameters 4116 from the digital image 4102. To illustrate, the scene-based image editing system 106 utilizes the illumination/camera model 4112 to determine one or more values or parameters of one or more light sources in the digital image 4102. Further, the scene-based image editing system 106 utilizes the illumination/camera model 4112 to determine one or more camera parameters based on the view represented in the digital image 4102. In one or more embodiments, the scene-based image editing system 106 utilizes camera parameters as described below to estimate a neural network: U.S. patent No. 11,094,083, entitled "utilizeng A CRITICAL EDGE DETECTION NEURAL NETWORK AND A GEOMETRIC MODEL TO DETERMINE CAMERA PARAMETERS FROM A SINGLE DIGITAL IMAGE" (determination of camera parameters FROM a single digital image using a critical edge detection neural network and geometric model), filed on 1/25 in 2019, incorporated herein by reference in its entirety. In additional embodiments, scene-based image editing system 106 extracts one or more camera parameters from metadata associated with digital image 4102.

In accordance with one or more embodiments, scene-based image editing system 106 detects interactions with a client device to modify digital image 4102. In particular, scene-based image editing system 106 detects user interactions with selected locations 4118 associated with objects within digital image 4102. For example, scene-based image editing system 106 detects user interactions that insert objects or move objects within digital image 4102. To illustrate, user interaction at the selected location 4118 moves the object from one location to another location in the digital image 4102. Alternatively, user interaction at the selected location 4118 inserts an object at the selected location in the digital image 4102.

In one or more embodiments, the scene-based image editing system 106 utilizes the shadow generation model 4120 to generate a modified digital image 4122 based on the digital image 4102. In particular, the shadow generation model 4120 generates a modified digital image 4122 by generating updated shadows according to the selected locations of the objects. To illustrate, the shadow generation model 4120 generates a plurality of shadow maps from the three-dimensional scene representation 4108, the illumination features 4114, and the camera parameters 4116 corresponding to the digital image 4102. In addition, the scene-based image editing system 106 utilizes the shadow generation model 4120 to generate a modified digital image 4122 based on the shadow map.

In additional embodiments, the scene-based image editing system 106 utilizes the repair model 4124 to generate the modified digital image 4122. In particular, in response to moving an object from one location to another location, the scene-based image editing system 106 utilizes the repair model 4124 to generate one or more repair areas in the modified digital image 4122. For example, the scene-based image editing system 106 utilizes the repair model 4124 to populate a portion of the three-dimensional grid corresponding to the initial position of the object in three-dimensional space. The scene-based image editing system 106 also utilizes the repair model 4124 (or another repair model) to generate corresponding portions of the fill-modified digital image 4122 in two-dimensional space. In one or more embodiments, the scene-based image editing system 106 utilizes a repair neural network as described below: U.S. patent application Ser. No. 17/663,317, entitled "OBJECT CLASS INPAINTING IN DIGITAL IMAGES UTILIZING CLASS-SPECIFIC INPAINTING NEURAL NETWORKS" (digital image OBJECT CLASS repair based on certain classes of repair NEURAL NETWORKS), or U.S. patent application Ser. No. 17/815,409, entitled "GENERATING NEURAL NETWORK BASED PERCEPTUAL ARTIFACT SEGMENTATIONS IN MODIFIED PORTIONS OF A DIGITAL IMAGE" (segmented NEURAL NETWORKS generated based on perceived artifacts in modified portions of digital images), filed on month 7, 2022, are incorporated herein by reference in their entirety.

As described above, the scene-based image editing system 106 modifies the two-dimensional image using the plurality of shadow maps. FIG. 42 illustrates a scene-based image editing system 106 editing a two-dimensional image by placing one or more objects within a scene of the two-dimensional image. In particular, FIG. 42 illustrates the scene-based image editing system 106 generating multiple shadow maps corresponding to different object types of objects placed within a two-dimensional image.

In one or more embodiments, the shadow map includes a map for projection that is used in a process for determining whether and where to generate one or more shadows in three-dimensional space. For example, a shadow map includes data indicating whether a particular object casts shadows in one or more directions according to one or more light sources within a three-dimensional space. Accordingly, the scene-based image editing system 106 utilizes the shadow map to determine whether a particular pixel in the scene-based rendered two-dimensional image is visible from the light source based on the depth value associated with the pixel (e.g., the corresponding three-dimensional position of the surface in three-dimensional space), the position and orientation of the light source, and/or one or more intermediate objects. Thus, the scene-based image editing system 106 utilizes multiple shadow maps to generate shadows from the locations of object(s) and/or light source(s) in the scene.

In one or more embodiments, the scene-based image editing system 106 processes a two-dimensional image 4200 that includes one or more foreground and/or background objects in a scene. For example, the scene-based image editing system 106 utilizes one or more image processing models (e.g., object detection models) to extract the background 4202 from the two-dimensional image 4200. In addition, the scene-based image editing system 106 utilizes one or more image processing models to extract the foreground objects 4204 from the two-dimensional image 4200. To illustrate, the scene-based image editing system 106 determines that the foreground objects 4204 of the two-dimensional image 4200 include large vehicles and the background 4202 includes roads of trees in rows.

In additional embodiments, the scene-based image editing system 106 also determines one or more additional objects for insertion into the two-dimensional image 4200 in connection with editing the two-dimensional image 4200. As an example, fig. 42 shows an insert object 4206 that includes graphics imported from a separate file or program. To illustrate, the insertion object 4206 includes a three-dimensional model with a previously generated three-dimensional mesh.

In accordance with one or more embodiments, scene-based image editing system 106 generates a plurality of shadow maps based on a plurality of shadow types corresponding to data extracted from two-dimensional image 4200. For example, the scene-based image editing system 106 determines separate foreground and background shadow maps. Further, the scene-based image editing system 106 generates a separate shadow map based on any objects inserted into the two-dimensional image 4200.

For example, fig. 42 shows that scene-based image editing system 106 generates estimated shadow map 4210 corresponding to background 4202. In particular, scene-based image editing system 106 generates estimated shadow map 4210 to represent shadows corresponding to one or more background objects and/or to off-camera objects corresponding to background 4202. To illustrate, the background 4202 includes shadows created by trees arranged on the road side that are invisible (or only partially visible) within the two-dimensional image 4200. Further, in one or more embodiments, scene-based image editing system 106 generates estimated shadow map 4210 for shadows created by ground formations or other objects (e.g., mountains) that are part of a background landscape.

In accordance with one or more embodiments, the scene-based image editing system 106 generates the estimated shadow map 4210 by determining pixels in the two-dimensional image 4200 that indicate the shadow portion. In particular, the scene-based image editing system 106 extracts a luminescence value (or other illumination-based value) that indicates that the pixel values of the portion of the two-dimensional image 4200 change due to shadows. The scene-based image editing system 106 utilizes the extracted values to identify a particular region of the two-dimensional image 4200 that is covered by the background shadow. Further, the scene-based image editing system 106 utilizes camera parameters (e.g., camera positions) corresponding to the two-dimensional image 4200 to generate a shadow map for the shadow region of the two-dimensional image 4200 (e.g., based on projection rays from the camera positions that incorporate background depth values or background grids). In some embodiments, the scene-based image editing system 106 renders the background grid with the estimated shadow map 4210 as texture, where the scene-based image editing system 106 writes the masked area to the estimated shadow map 4210 as occlusion pixels at infinity.

In one or more embodiments, the scene-based image editing system 106 also generates a separate shadow map for the foreground object 4204. In particular, the foreground object 4204 detected in the two-dimensional image 4200 may include a partial object based on a visible portion of the foreground object 4204. For example, when determining the three-dimensional characteristics of the foreground object 4204, the scene-based image editing system 106 determines the three-dimensional characteristics of the visible portion. In some embodiments, scene-based image editing system 106 generates a partial three-dimensional grid representing the visible portion of foreground object 4204.

Thus, in one or more embodiments, the scene-based image editing system 106 generates the shadow agent 4208 to include a more complete representation of the foreground object 4204. For example, the scene-based image editing system 106 generates the shadow agent 4208 to determine one or more shadows cast by the foreground object 4204 by estimating a complete (or more complete) three-dimensional shape of the foreground object 4204. In some embodiments, as described in more detail below with reference to fig. 47A-47B, the scene-based image editing system 106 generates a shadow agent 4208 to replace one or more shadows within a scene of the two-dimensional image 4200 corresponding to the foreground object 4204.

In at least some embodiments, scene-based image editing system 106 utilizes shadow proxy 4208 to generate proxy shadow map 4212 for foreground object 4204. In particular, scene-based image editing system 106 generates proxy shadow map 4212 for generating one or more shadows associated with shadow proxy 4208. To illustrate, scene-based image editing system 106 generates proxy shadow map 4212 to include one or more shadows based on shadow proxy 4208, rather than foreground object 4204. In conjunction with generating the proxy shadow map 4212, in one or more embodiments, the scene-based image editing system 106 removes one or more shadows generated by the foreground objects 4204. Thus, the scene-based image editing system 106 uses the proxy shadow map 4212 as a replacement for the original shadows of the foreground objects 4204.

Fig. 42 also shows that scene-based image editing system 106 generates object shadow map 4214 representing insertion object 4206. In particular, the scene-based image editing system 106 generates the object shadow map 4214 to represent any objects inserted into the two-dimensional image 4200 for which the scene-based image editing system 106 has determined a previously generated three-dimensional grid (or an already existing three-dimensional grid) for the two-dimensional image 4200. For example, the scene-based image editing system 106 determines that one or more objects inserted into the two-dimensional image 4200 have corresponding three-dimensional grids imported from one or more other files or applications.

Thus, as described above, scene-based image editing system 106 generates each separate shadow map associated with a plurality of different object types or shadow types. For example, the different shadow maps include different information for generating shadows based on the corresponding object types. To illustrate, scene-based image editing system 106 determines whether each object type is visible, illuminated, cast a shadow, or receive a shadow within two-dimensional image 4200. Scene-based image editing system 106 also determines whether each object type receives shadows from one or more other particular object types.

As an example, the scene-based image editing system 106 determines that the background object (e.g., background 4202) is visible, not illuminated, and does not cast a shadow (e.g., as being part of the background). In addition, scene-based image editing system 106 determines that the background object receives shadows from the shadow proxy and the inserted object. In one or more embodiments, the scene-based image editing system 106 determines that the foreground object (e.g., foreground object 4204) is visible, not illuminated, and does not cast shadows (e.g., due to having incomplete three-dimensional property data). The scene-based image editing system 106 also determines that the foreground object receives shadows from the interpolated object and estimated shadow map 4210. In additional embodiments, scene-based image editing system 106 also determines that an inserted object (e.g., inserted object 4206) is visible, illuminated, casts a shadow, and receives shadows from all other shadow sources.

In one or more embodiments, although FIG. 42 shows individual objects for each object type, scene-based image editing system 106 alternatively determines that one or more object types have multiple objects. For example, the scene-based image editing system 106 determines that the two-dimensional image includes more than one foreground object and/or more than one inserted object. Thus, the scene-based image editing system 106 generates each shadow map associated with all objects of the corresponding type. To illustrate, the scene-based image editing system 106 generates proxy shadow maps associated with a plurality of foreground objects or object shadow maps associated with a plurality of intervening objects.

In accordance with one or more embodiments, the scene-based image editing system 106 utilizes the shadow map to generate a modified two-dimensional image in response to generating a plurality of shadow maps for a plurality of object types associated with editing the two-dimensional image. For example, the scene-based image editing system 106 utilizes camera and illumination information associated with the two-dimensional image to generate realistic shadows in the modified two-dimensional image from the shadow map. Fig. 43 shows a diagram of the scene-based image editing system 106 determining illumination information associated with a two-dimensional image 4300.

In particular, FIG. 43 illustrates the scene based image editing system 106 processing a two-dimensional image 4300 to determine image based illumination parameters 4302 and light sources 4304. For example, the scene-based image editing system 106 utilizes one or more image processing models to estimate the position, brightness, color, and/or hue of the light source 4304. Further, the scene-based image editing system 106 utilizes one or more image processing models to determine camera position/location, focal length, or other camera parameters relative to the light source 4304. In one or more embodiments, the scene-based image editing system 106 utilizes a machine learning model as described below: U.S. patent application Ser. No. 16/558,975, filed on 3/9/2019, entitled "DYNAMICALLY ESTIMATING LIGHT-SOURCE-SPECIFIC PARAMETERS FOR DIGITAL IMAGES USING A NEURAL NETWORK" (dynamic estimation of light SOURCE specific parameters of digital images using neural networks), which is incorporated herein by reference in its entirety.

In accordance with one or more embodiments, in response to determining the illumination parameters of the one or more light sources based on the two-dimensional image, the scene-based image editing system 106 inserts the one or more corresponding light sources into a three-dimensional space having a three-dimensional representation of the scene in the two-dimensional image. In particular, the scene-based image editing system 106 utilizes the estimated light source position, the estimated light source direction, the estimated light source intensity, and the estimated light source type (e.g., point source, line source, area source, image-based light source, or global source) to insert the light source into the three-dimensional space. To illustrate, the scene-based image editing system 106 inserts light sources at specific locations and has specific parameters estimated from the two-dimensional image 4300 to provide light to a plurality of three-dimensional objects (e.g., foreground objects and background objects) for rendering according to the image-based illumination parameters 4302 and the estimated camera parameters.

In conjunction with determining illumination parameters for one or more light sources in the digital image and/or one or more camera parameters of the digital image, the scene-based image editing system 106 renders the modified digital image in response to user input. Fig. 44 illustrates that the scene-based image editing system 106 generates a modified two-dimensional image 4400 based on one or more objects inserted or modified within a two-dimensional image (e.g., two-dimensional image 4200 of fig. 42). Further, fig. 44 illustrates that the scene-based image editing system 106 utilizes the three-dimensional characteristics of illumination data, shadow data, and content in a two-dimensional image to generate a modified two-dimensional image 4400.

In accordance with one or more embodiments, scene-based image editing system 106 utilizes rendering model 4402 to process data from a two-dimensional image and one or more objects associated with the two-dimensional image. For example, the scene-based image editing system 106 utilizes a rendering model 4402 including a three-dimensional rendering model to generate a modified two-dimensional image 4400 based on a three-dimensional grid 4404 representing one or more objects in the two-dimensional image. Furthermore, scene-based image editing system 106 utilizes rendering model 4402 to generate shadows of one or more objects from shadow map 4406 and illumination parameters 4408. Thus, the scene-based image editing system 106 reconstructs the scene in the two-dimensional image using the three-dimensional grid 4404, the shadow map 4406, and the illumination parameters 4408.

In one or more embodiments, the scene-based image editing system 106 utilizes shadow maps to generate one or more shadows for one or more foreground objects and/or one or more inserted objects. In particular, the scene-based image editing system 106 utilizes illumination parameters (e.g., light source locations) to incorporate shadow maps for synthesizing the modified two-dimensional image 4400. For example, the scene-based image editing system 106 utilizes the rendering model 4402 to merge the shadow map 4406 based on the object type and relative position of the three-dimensional grid 4404 according to the illumination parameters 4408. To illustrate, as described above, the scene-based image editing system 106 incorporates the shadow map 4406 to create a foreground shadow map and a background shadow map. In some embodiments, scene-based image editing system 106 also incorporates shadow map 4406 to generate an inset shadow map of one or more inset objects.

As an example, the scene-based image editing system 106 generates a foreground shadow map by combining an estimated shadow map corresponding to the background of the two-dimensional image and an object shadow map corresponding to the inserted object. Specifically, the scene-based image editing system 106 generates a foreground shadow map as: FOREGROUND (x) =min (INSERT (x), SHADOWMASK (x)), where OBJECT represents the OBJECT shadow map, SHADOWMASK represents the estimated shadow map, and x represents the distance of a particular pixel from the light source. In addition, the scene-based image editing system 106 generates a background shadow map as: BACKGROUND (x) =max (1-FOREGROUND (x), min (OBJECT (x), PROXY (x))), where PROXY represents a PROXY shadow map. Furthermore, the scene-based image editing system 106 generates an inset shadow map as: INSERT (x) =min (FOREGROUND (x), PROXY (x)). By merging the shadow maps as described above, the scene-based image editing system 106 thus detects whether the first object is between the light source and the second object for determining how to shadow the first object and/or the second object.

In one or more embodiments, the scene-based image editing system 106 applies the merged shadow map when shading individual objects in the scene according to the object type/category of each object. Specifically, as described above, the scene-based image editing system 106 calculates light intensity by sampling the object shadow map to shadow the interpolated object using a physical-based rendering shader. In one or more embodiments, the scene-based image editing system 106 does not render proxy objects in the final color output (e.g., conceals proxy objects in the modified two-dimensional image 4400). Furthermore, the scene-based image editing system 106 generates background and foreground object colors by: COLOR (x) = (showjfactor (x))xtext (x) + (1-showjfactor (x))xshowjcolor. SHADOW FACTOR is a value generated by scene-based image editing system 106 by sampling the appropriate SHADOW map, the larger the sampling radius, the softer the SHADOW generated. In addition, showjcolor represents ambient light determined by scene-based image editing system 106 via a SHADOW estimation model or based on user input. TEXTURE represents the TEXTURE applied to the three-dimensional grid 4404 according to the corresponding pixel values in the two-dimensional image.

While the above-described embodiments include the scene-based image editing system 106 merging shadow maps for the background, shadow proxies, and inserted objects for the digital image, in some embodiments the scene-based image editing system 106 generates and merges a subset of the combinations of the shadow maps described above. For example, in some embodiments, the scene-based image editing system 106 generates an estimated shadow map of the background based on the two-dimensional image and a proxy shadow map of the proxy three-dimensional mesh based on the foreground object, without generating a shadow map for the intervening objects (e.g., in response to determining that the two-dimensional image does not have any intervening objects). The scene-based image editing system 106 combines the estimated shadow map and the proxy shadow map to generate a modified two-dimensional image. Alternatively, the scene-based image editing system 106 combines the estimated shadow map and shadow map of one or more intervening objects without using a proxy shadow map (e.g., in response to determining that the two-dimensional image does not have any other foreground objects). Thus, the scene-based image editing system 106 may utilize the above formulas to incorporate and apply different shadow maps for different object types according to a particular object in a two-dimensional image.

Furthermore, the scene-based image editing system 106 utilizes the illumination parameters 4408 to determine the illumination and shading of objects in the scene of the modified two-dimensional image 4400. For example, when rendering the modified two-dimensional image 4400, the scene-based image editing system 106 determines ambient lighting, ambient occlusion, reflection, and/or other effects based on the lighting parameters 4408 and the three-dimensional grid 4404. In one or more embodiments, the scene-based image editing system 106 determines ambient lighting by blurring a copy of a two-dimensional image, wrapping the blurred copy of the two-dimensional image around the three-dimensional grid 4404 (e.g., generating a 360 degree environment around the scene), and rendering the resulting scene. In an alternative embodiment, scene-based image editing system 106 utilizes one or more neural networks to determine illumination from one or more hidden/out-of-camera portions of the scene for use during rendering of modified two-dimensional image 4400.

As described above, in one or more embodiments, the scene-based image editing system 106 provides object segmentation in a two-dimensional image. For example, the scene-based image editing system 106 utilizes object segmentation to identify foreground/background objects and/or to generate a three-dimensional grid corresponding to the foreground/background objects. FIG. 45 illustrates the generation of a semantic graph by the scene-based image editing system 106 to indicate individual objects within a two-dimensional image to generate an overview of one or more two-dimensional object meshes in a corresponding three-dimensional representation of a scene of the two-dimensional image.

As shown in fig. 45, the scene-based image editing system 106 determines a two-dimensional image 4500. For example, the two-dimensional image 4500 includes a plurality of objects-e.g., one or more objects in a foreground region and/or one or more objects in a background region. In one or more embodiments, the scene-based image editing system 106 utilizes a semantic segmentation neural network (e.g., object detection model, deep learning model) to automatically label pixels of the two-dimensional image 4500 as object classifications based on objects detected in the two-dimensional image 4500. In various embodiments, the scene-based image editing system 106 utilizes various models or architectures to determine object classification and image segmentation, as previously described. In addition, scene-based image editing system 106 generates a semantic graph 4502 comprising object classifications of pixels of two-dimensional image 4500.

In one or more embodiments, the scene-based image editing system 106 utilizes the semantic graph 4502 to generate a segmented three-dimensional grid 4504. Specifically, the scene-based image editing system 106 utilizes object classification of pixels in the two-dimensional image 4500 to determine portions of the three-dimensional grid corresponding to objects in the two-dimensional image 4500. For example, the scene-based image editing system 106 utilizes a mapping between the two-dimensional image 4500 and a three-dimensional grid representing the two-dimensional image 4500 to determine object classifications for portions of the three-dimensional grid. To illustrate, the scene-based image editing system 106 determines a particular vertex of a three-dimensional mesh corresponding to a particular object (e.g., foreground object) detected in the two-dimensional image 4500 based on a mapping between the two-dimensional image 4500 and the two-dimensional image 4500.

In one or more embodiments, in response to determining that different portions of the three-dimensional grid associated with the two-dimensional image correspond to different objects, the scene-based image editing system 106 segments the three-dimensional grid. In particular, the scene-based image editing system 106 utilizes object classification information associated with portions of the three-dimensional mesh to separate the three-dimensional mesh into a plurality of separate three-dimensional object meshes. For example, the scene-based image editing system 106 determines that a portion of the three-dimensional grid corresponds to a car in the two-dimensional image 4500 and separates the portion of the three-dimensional grid corresponding to the car from the rest of the three-dimensional grid.

Thus, in one or more embodiments, the scene-based image editing system 106 segments a three-dimensional grid into two or more separate grids corresponding to a two-dimensional image. To illustrate, the scene-based image editing system 106 generates a segmented three-dimensional mesh 4504 by separating the two-dimensional image 4500 into a plurality of separate three-dimensional object meshes in the scene. For example, the scene-based image editing system 106 generates a three-dimensional object grid corresponding to an automobile, a three-dimensional object grid corresponding to a road, one or more three-dimensional object grids corresponding to one or more tree groups, and so forth.

In additional embodiments, the scene-based image editing system 106 segments the three-dimensional grid based on a subset of objects in the two-dimensional image. To illustrate, the scene-based image editing system 106 determines one or more objects in the two-dimensional image 4500 for segmenting a three-dimensional grid. For example, the scene-based image editing system 106 determines one or more objects in the foreground of the two-dimensional image 4500 to generate a separate three-dimensional object grid. In some embodiments, scene-based image editing system 106 determines a saliency (e.g., scale size) of objects used to generate the separate three-dimensional object grid. In one or more embodiments, the scene-based image editing system 106 determines one or more objects in response to selection of the one or more objects (e.g., manually selecting an automobile in the two-dimensional image 4500 via a graphical user interface displaying the two-dimensional image 4500). Alternatively, the scene-based image editing system 106 determines whether the object belongs to the foreground or the background and generates a separate grid for only the foreground object.

In accordance with one or more embodiments, the scene-based image editing system 106 provides a tool for editing two-dimensional digital images based on estimated three-dimensional characteristics of the digital images. 46A-46C illustrate a graphical user interface of a client device for editing a two-dimensional image based on a three-dimensional representation of the generated two-dimensional image. For example, the scene-based image editing system 106 provides tools for moving objects within and/or inserting (e.g., importing) objects into a two-dimensional image.

In one or more embodiments, as shown in FIG. 46A, the client device displays a two-dimensional image 4600a for editing within the client application. For example, client applications include a digital image editing application that includes a plurality of tools for generating digital images or performing various modifications to digital images. Thus, the client device displays the two-dimensional image 4600a to be modified through interaction with one or more objects in the two-dimensional image 4600a.

In connection with editing two-dimensional image 4600a, scene-based image editing system 106 determines three-dimensional characteristics of a scene in two-dimensional image 4600a. For example, the scene-based image editing system 106 utilizes one or more neural networks to generate one or more three-dimensional grids corresponding to objects in the scene of the two-dimensional image 4600a. To illustrate, the scene-based image editing system 106 generates a first three-dimensional grid representing a plurality of background objects in a background of a scene. Further, the scene-based image editing system 106 generates a second three-dimensional grid representing the foreground object 4602a at a first location within the scene of the two-dimensional image 4600a (e.g., by segmenting the second three-dimensional grid from the first three-dimensional grid). In some embodiments, scene-based image editing system 106 generates one or more additional three-dimensional grids corresponding to one or more additional detected foreground objects (e.g., one or more portions of a fence shown in two-dimensional image 4600 a).

In accordance with one or more embodiments, in response to input selecting an object in the two-dimensional image 4600a via a graphical user interface of the client device, the scene-based image editing system 106 provides one or more options for placing the selected object within the two-dimensional image 4600a. In particular, the client device provides a tool within the client application by which a user can select to move an object within the two-dimensional image 4600a. For example, in response to selection of the foreground object 4602a, the scene-based image editing system 106 selects a three-dimensional grid corresponding to the foreground object 4602 a. To illustrate, in response to the selection, the scene-based image editing system 106 selects a plurality of vertices corresponding to the three-dimensional mesh of the foreground object 4602 a.

In one or more embodiments, the scene-based image editing system 106 modifies the two-dimensional image 4600a by changing the position of the selected object based on user input via the client device. For example, in response to a request to move the foreground object 4602a from a first location (e.g., an original location of the foreground object 4602 a) to a second location, the scene-based image editing system 106 modifies the location of the corresponding three-dimensional grid in three-dimensional space from the first location to the second location. Further, scene-based image editing system 106 updates two-dimensional image 4600a displayed within the graphical user interface of the client device in response to moving the selected object.

Fig. 46B illustrates the client device displaying a modified two-dimensional image 4600B based on the moving foreground object 4602B from the first position to the second position illustrated in fig. 46A. In particular, by generating a plurality of three-dimensional meshes in a two-dimensional image corresponding to foreground object(s) and background object(s), the scene-based image editing system 106 provides realistic shadows related to placing the moving foreground object 4602b in the second location. For example, as shown, the new location of the scene-based image editing system 106 within the modified two-dimensional image 4600b generates updated shadows 4604 for the moving foreground object 4602 b.

In one or more embodiments, the scene-based image editing system 106 removes the previous shadow corresponding to the foreground object 4602a at the first location (as shown in fig. 46A). For example, the scene-based image editing system 106 utilizes the previously described shadow removal operations to remove the original shadows produced by the foreground object 4602a relative to the light source of the two-dimensional image 4600 a. In at least some embodiments, the scene-based image editing system 106 removes initial shadows before moving the foreground object 4602a within the two-dimensional image 4600 a. To illustrate, the scene-based image editing system 106 removes initial shadows in response to selection of a tool or function to move an object within the two-dimensional image 4600a or in response to opening the two-dimensional image 4600a within a client application. In additional embodiments, the scene-based image editing system 106 generates a new shadow for the foreground object 4602a at the first location and updates the new shadow in response to moving the foreground object 4602a (e.g., generates an updated shadow 4604). In other embodiments, in response to placing the foreground object 4602b in the second position, the scene-based image editing system 106 removes the initial shadows and generates updated shadows 4604.

As described above, the scene-based image editing system 106 generates shadows based on multiple shadow maps according to object types in a two-dimensional image. For example, the scene-based image editing system 106 generates updated shadows 4604 of the moving foreground object 4602b based at least in part on the estimated shadow map of the background and the proxy shadow map of the moving foreground object 4602 b. To illustrate, the scene-based image editing system 106 generates a shadow proxy (e.g., a proxy three-dimensional grid) for the moving foreground object 4602b and generates updated shadows 4604 using the shadow proxy instead of the moving foreground object 4602 b. Accordingly, the scene-based image editing system 106 merges the corresponding shadow maps according to the estimated illumination parameters of the modified two-dimensional image 4600b to generate updated shadows 4604.

Further, in one or more embodiments, scene-based image editing system 106 generates one or more repair areas in response to moving an object within a two-dimensional image. In particular, as shown in fig. 46B, the scene-based image editing system 106 detects one or more regions of the background portion covered by the foreground object. Specifically, as shown, in response to moving the car, the scene-based image editing system 106 determines that a portion of the road, fence, plant, etc. in the background behind the car is exposed.

The scene-based image editing system 106 utilizes one or more repair models (e.g., one or more neural networks) to repair one or more regions of the background portion. To illustrate, the scene-based image editing system 106 generates a repair region 4606 to recover lost information based on the moving foreground object 4602b previously covering one or more regions in the background portion. For example, the scene-based image editing system 106 utilizes the first model to reconstruct the grid at a location in three-dimensional space corresponding to the two-dimensional coordinates of the region behind the moving foreground object 4602 b. More specifically, the scene-based image editing system 106 utilizes the smoothing model to generate a smoothed depth value based on estimated three-dimensional points in three-dimensional space that correspond to regions in the two-dimensional image 4600a that are adjacent to or surrounding regions in the background portion. Thus, the scene-based image editing system 106 generates a plurality of vertices and edges to fill "holes" in the three-dimensional grid corresponding to the background behind the moving foreground object 4602 b.

In one or more embodiments, scene-based image editing system 106 utilizes the filled portion of the three-dimensional mesh in the background to generate repair area 4606. In particular, scene-based image editing system 106 utilizes the repair model to generate predicted pixel values for pixels within repair area 4606. For example, the scene-based image editing system 106 uses the repair model to capture features of the region near the repair area 4606, such as based on pixel values, detected objects, or other visual or semantic attributes of the region near the repair area 4606. Accordingly, the scene-based image editing system 106 utilizes the context information from the neighboring areas to generate repair content for the repair area 4606. Furthermore, scene-based image editing system 106 applies repair area 4606 as a texture to the reconstructed mesh portion.

As described above, the scene-based image editing system 106 also provides a tool for inserting objects into a two-dimensional image. Fig. 46C shows the client device displaying a modified two-dimensional image 4600C including the foreground object 4602a of fig. 46A in an initial position. In addition, fig. 46C shows that the client device displays an insertion object 4608 imported into the modified two-dimensional image 4600C.

In one or more embodiments, in response to a request to insert an object into a two-dimensional image, scene-based image editing system 106 accesses the specified object and inserts the object into the two-dimensional image for display via a graphical user interface of a client device. For example, the selected object is associated with a previously defined three-dimensional grid. To illustrate, in response to a request to import an inserted object 4608 into a three-dimensional representation of a modified two-dimensional image 4600c, the scene-based image editing system 106 imports a three-dimensional grid corresponding to the inserted object 4608 from a separate file, database, or application.

Alternatively, the scene-based image editing system 106 imports a two-dimensional grid corresponding to the insertion object 4608 into the modified two-dimensional image 4600c. For example, in association with a request to insert an object into a two-dimensional image, the scene-based image editing system 106 generates a three-dimensional grid representing the two-dimensional object. For example, the scene-based image editing system 106 utilizes one or more neural networks to generate a three-dimensional grid representing the inserted object 4608 based on visual features of the inserted object 4608, semantic information associated with the inserted object 4608 (e.g., by detecting that the object is a road sign), or other information associated with the inserted object 4608.

In one or more embodiments, scene-based image editing system 106 utilizes a three-dimensional grid corresponding to insertion object 4608 to generate shadow 4610 within modified two-dimensional image 4600 c. In particular, scene-based image editing system 106 generates an object shadow map corresponding to insertion object 4608 (e.g., based on the corresponding three-dimensional mesh and lighting parameters). To illustrate, as previously described, the scene-based image editing system 106 generates an object shadow map for rendering shadows cast by the intervening object 4608 onto one or more other objects (i.e., such that the shadows follow the three-dimensional contours of the other objects depicted in the two-dimensional scene, as shown by the shadows on the automobile of fig. 46C) from one or more light sources in the modified two-dimensional image 4600C.

The scene-based image editing system 106 also applies shadow maps when rendering the modified two-dimensional image 4600 c. In particular, the scene-based image editing system 106 merges an object shadow map based on the inserted object 4608 with one or more additional shadow maps for the modified two-dimensional image 4600 c. For example, the scene-based image editing system 106 determines that the modified two-dimensional image 4600c includes an insertion object 4608 and a foreground object 4602a (e.g., in an initial position). The scene-based image editing system 106 combines the object shadow map corresponding to the inserted object 4608, the proxy shadow map corresponding to the shadow proxy of the foreground object 4602a, and the estimated shadow map of the background of the modified two-dimensional image 4600 c.

To illustrate, the scene-based image editing system 106 combines the foreground object 4602a and the background of the modified two-dimensional image 4600c, based on the interpolated object 4608, to merge the shadow map to generate the shadow 4610. For example, the scene-based image editing system 106 determines that the insertion object 4608 is at least partially located between the foreground object 4602a and the light source extracted from the modified two-dimensional image 4600 c. In addition, scene-based image editing system 106 generates one or more shadows in modified two-dimensional image 4600c by merging shadow maps based on the relative three-dimensional positions of objects and the characteristics of each object type. Thus, the scene-based image editing system 106 merges the shadow map to cast shadows 4610 of the inserted objects 4608 onto at least a portion of the foreground objects 4602a and a portion of the background.

In one or more embodiments, the scene-based image editing system 106 updates shadows of the two-dimensional image in real-time as the user edits the image. For example, the scene-based image editing system 106 determines one or more requests to insert, move, or delete one or more objects in the digital image based on input via the client device. In response to such a request, scene-based image editing system 106 generates an updated shadow map for each object type in the digital image. Further, the scene-based image editing system 106 renders the modified digital image (or a preview of the modified digital image) within the graphical user interface of the client device based on the updated location of the object and the combined shadow map.

In one or more additional embodiments, the scene-based image editing system 106 provides realistic shadow generation in a two-dimensional image through shadow mapping for multiple image editing operations. For example, while the above embodiments describe editing a two-dimensional image by placing objects at specific locations within the two-dimensional image, the scene-based image editing system 106 also generates and incorporates multiple shadow maps in response to modifications to objects in the two-dimensional image. To illustrate, in response to a request to change the shape, size, or orientation of an object within a two-dimensional image, the scene-based image editing system 106 updates three-dimensional characteristics (e.g., a proxy three-dimensional grid) for the object based on the request. Further, the scene-based image editing system 106 generates an updated shadow map (e.g., an updated proxy shadow map) based on the updated three-dimensional characteristics and re-renders the two-dimensional image from the updated shadow map.

As described above, the scene-based image editing system 106 generates shadow proxies for objects within the two-dimensional image for editing the two-dimensional image. Fig. 47A-47B illustrate an example three-dimensional grid of a scene corresponding to a two-dimensional image. Specifically, the three-dimensional mesh of fig. 47A-47B corresponds to the scene of the two-dimensional image of fig. 46A. Fig. 47A shows a three-dimensional grid representing background and foreground objects in a two-dimensional image. FIG. 47B illustrates a shadow proxy for a foreground object.

As shown in fig. 47A, the scene-based image editing system 106 generates one or more three-dimensional grids based on two-dimensional images. Specifically, the scene-based image editing system 106 generates a first three-dimensional grid 4700 representing background content in a scene of a two-dimensional image (e.g., based on estimated depth values from the two-dimensional image). The scene-based image editing system 106 also determines a foreground object (e.g., an automobile) from the two-dimensional image and generates a second three-dimensional grid 4702 for the foreground object. In some embodiments, the scene-based image editing system 106 generates an initial three-dimensional grid that includes all of the content of the two-dimensional image and separates the three-dimensional grid into a first three-dimensional grid 4700 and a second three-dimensional grid 4702. In an alternative embodiment, scene-based image editing system 106 (e.g., based on segmentation masking) generates first three-dimensional grid 4700 and second three-dimensional grid 4702, respectively.

In one or more embodiments, the scene-based image editing system 106 generates a second three-dimensional grid 4702 representing the visible portion of the foreground object. For example, the scene-based image editing system 106 generates the second three-dimensional grid 4702 with depth values corresponding to pixels in the two-dimensional image (e.g., by separating the second three-dimensional grid 4702 from the first three-dimensional grid 4700 at the boundary of the foreground object). In addition, as shown in fig. 47A, the second three-dimensional mesh 4702 lacks detail since only a portion of the foreground object is visible in the two-dimensional image.

Further, in one or more embodiments, scene-based image editing system 106 utilizes one or more repair models to repair a portion of a two-dimensional image and a corresponding portion of a three-dimensional grid. As shown in fig. 47A, the scene-based image editing system 106 repairs portions of the first three-dimensional grid 4700 corresponding to a portion of the background behind the foreground object. For example, the scene-based image editing system 106 inserts a plurality of vertices into the portion of the first three-dimensional mesh 4700 based on the three-dimensional locations of vertices in adjacent regions of the first three-dimensional mesh 4700 (e.g., using a smoothing model) to provide a consistent three-dimensional depth for the portion. Further, the scene-based image editing system 106 generates textures to apply to the portion of the first three-dimensional grid 4700 using the additional repair model.

In accordance with one or more embodiments, scene-based image editing system 106 generates shadow proxies for foreground objects. In particular, FIG. 47B illustrates the scene-based image editing system 106 generating a proxy three-dimensional grid 4704 to represent foreground objects. For example, the scene-based image editing system 106 generates the proxy three-dimensional grid 4704 based on the foreground object according to the three-dimensional position of the second three-dimensional grid 4702 in three-dimensional space. To illustrate, the scene-based image editing system 106 generates the proxy three-dimensional grid 4704 to estimate the shape of visible and invisible portions of foreground objects in three-dimensional space in order to create shadows when rendering a two-dimensional image.

In some embodiments, the scene-based image editing system 106 generates the proxy three-dimensional grid 4704 by determining the symmetry axis or plane of the foreground object. For example, the scene-based image editing system 106 processes features of the visible portion of the foreground object to detect an axis of symmetry (e.g., based on repeated features). To illustrate, the scene-based image editing system 106 utilizes image processing techniques and/or object detection techniques to determine the symmetry axis of the automobile in the two-dimensional image based on the visibility of the two tail lights of the automobile in the two-dimensional image. Thus, the scene-based image editing system 106 determines that the foreground object has a plane of symmetry at about half of the trunk's markers, and that the plane of symmetry intersects the middle of the car.

In response to detecting the symmetry axis of the foreground object, the scene-based image editing system 106 generates a proxy three-dimensional mesh 4704 by mirroring the second three-dimensional mesh 4702 and stitching the mirrored portion to the second three-dimensional mesh 4702. In one or more embodiments, scene-based image editing system 106 replicates a plurality of vertices in second three-dimensional grid 4702 and replicates the replicated vertices. In addition, scene-based image editing system 106 mirrors the positions of the replicated vertices and translates the replicated vertices according to the symmetry axis. Scene-based image editing system 106 also connects vertices to replicated vertices across axes of symmetry to create a complete three-dimensional grid. In an alternative embodiment, scene-based image editing system 106 mirrors vertices of an existing mesh to the other side of the symmetry axis and shrink wraps the vertices from the sphere to a surface of the geometry that includes the mirrored portion with the sphere object.

In additional embodiments, the scene-based image editing system 106 generates the proxy three-dimensional mesh 4704 by importing a previously generated three-dimensional mesh representing the shape of the foreground object. In particular, the scene-based image editing system 106 utilizes object detection to determine that a foreground object belongs to a particular object class associated with a previously generated three-dimensional grid. To illustrate, the scene-based image editing system 106 determines that the foreground object includes a particular brand and/or model of automobile associated with a previously generated three-dimensional grid. The scene-based image editing system 106 accesses a previously generated three-dimensional grid (e.g., from a database) and inserts the three-dimensional grid into a corresponding location in three-dimensional space to generate a proxy three-dimensional grid 4704.

In some embodiments, scene-based image editing system 106 determines that a particular object in the two-dimensional image includes a person having a particular pose. The scene-based image editing system 106 utilizes the model to generate a three-dimensional grid of humans having a particular pose. Thus, the scene-based image editing system 106 generates a proxy three-dimensional mesh based on a pose human model generated via one or more neural networks that extract shape and pose information (e.g., a pose of a human shape detected by regression) from the two-dimensional image.

Although FIG. 47B illustrates that the proxy three-dimensional grid 4704 is visible in three-dimensional space, in one or more embodiments, the scene-based image editing system 106 hides the proxy three-dimensional grid 4704 within the graphical user interface. For example, the scene-based image editing system 106 generates shadows in the rendered two-dimensional image that are not visible in the two-dimensional image using the proxy three-dimensional grid 4704. Furthermore, in some embodiments, scene-based image editing system 106 utilizes a three-dimensional grid of foreground objects rather than a proxy three-dimensional grid to render shadows cast by one or more other objects on the foreground objects. Thus, the scene-based image editing system 106 may approximate shadows generated by foreground objects via the proxy three-dimensional grid 4704 while providing accurate rendering of shadows on the visible portion of the foreground objects.

Fig. 48A-48B illustrate additional examples of the scene-based image editing system 106 modifying a two-dimensional image via three-dimensional characteristics of a two-dimensional scene. In particular, fig. 48A shows a first two-dimensional image 4800 including a foreground object 4802 (e.g., a person). In one or more embodiments, the scene-based image editing system 106 receives a request to copy the foreground object 4802 from the first two-dimensional image 4800 and insert the foreground object 4802 into the second two-dimensional image.

In accordance with one or more embodiments, in response to the request or in response to inserting the foreground object 4802 into another digital image, the scene-based image editing system 106 generates a proxy three-dimensional grid for the foreground object 4802. Fig. 48B shows a second two-dimensional image 4804 in which a foreground object 4802a is inserted. Specifically, the scene-based image editing system 106 inserts the foreground object 4802a into the second two-dimensional image 4804 at the selected location. Further, the scene-based image editing system 106 inserts a proxy three-dimensional mesh corresponding to the foreground object 4802a at a corresponding location in a three-dimensional space that includes a three-dimensional mesh representing the scene of the second two-dimensional image 4804.

Further, in one or more embodiments, scene-based image editing system 106 generates a plurality of shadow maps based on second two-dimensional image 4804 and a proxy three-dimensional grid corresponding to foreground object 4802 a. Specifically, in response to inserting the proxy three-dimensional mesh of the foreground object 4802a into the three-dimensional space corresponding to the second two-dimensional image 4804, the scene-based image editing system 106 generates a proxy shadow map for the second two-dimensional image 4804. Further, the scene-based image editing system 106 generates an estimated shadow map for the second two-dimensional image 4804 from shadows (e.g., shadows generated by trees and shrubs) already present in the second two-dimensional image 4804.

In conjunction with inserting the foreground object 4802a into the second two-dimensional image 4804, the scene-based image editing system 106 renders the modified two-dimensional image to include shadows 4806 generated by the foreground object 4802 a. Further, as shown, the scene-based image editing system 106 renders the modified two-dimensional image to include the effects of one or more shadows cast by the background object onto the foreground object 4802 a. In particular, scene-based image editing system 106 combines the proxy shadow map and the estimated shadow map to determine whether to place shadows cast by foreground object 4802a and the background and where to place them. As shown in fig. 48B, the foreground object 4802a casts a shadow onto the background, and the background (e.g., a tree) casts at least a portion of the shadow onto the foreground object 4802 a. Moving the object within the second two-dimensional image 4804 causes the scene-based image editing system 106 to update the position and effect of shadows of the foreground object 4802a and/or the background.

FIG. 49 illustrates a graphical user interface for editing various visual characteristics of objects in a two-dimensional image. In particular, fig. 49 shows a client device displaying a two-dimensional image 4900 including an insertion object 4902. For example, the client device displays the object 4902 inserted into the two-dimensional image 4900 at the selected location. In one or more embodiments, the client device provides a tool for modifying the position of object 4902, such as by moving object 4902 along the ground in three-dimensional space. Thus, as object 4902 moves within two-dimensional image 4900, scene-based image editing system 106 also moves the corresponding three-dimensional mesh of object 4902 relative to the three-dimensional mesh representing the scene within the three-dimensional space of two-dimensional image 4900. To illustrate, moving object 4902 forward along the ground in two-dimensional image 4900 causes scene-based image editing system 106 to increase or decrease the size and/or rotation of object 4902 based on the geometry of the ground in three-dimensional space.

Further, in one or more embodiments, scene-based image editing system 106 generates shadows 4904 of objects 4902 based on multiple shadow maps of two-dimensional image 4900. For example, the scene-based image editing system 106 generates an estimated shadow map based on the background of the two-dimensional image 4900, which includes trees and their corresponding shadows based on the light sources in the background. In addition, scene-based image editing system 106 generates an object shadow map for object 4902. The scene-based image editing system 106 generates shadows 4904 of the object 4902 by merging shadow maps, including perspectives based on the direction and length of the light sources and corresponding to the three-dimensional grid of the object 4902 in three-dimensional space. Further, moving object 4902 in two-dimensional image 4900 causes the location, direction, and size of shadow 4904 updated by scene-based image editing system 106.

In one or more additional embodiments, the scene-based image editing system 106 also provides a tool for editing the three-dimensional characteristics of the object 4902. In particular, fig. 49 shows a set of visual indicators 4906 for rotating an object 4902 in three-dimensional space. More specifically, in response to detecting one or more interactions with the set of visual indicators 4906, scene-based image editing system 106 modifies an orientation of the three-dimensional grid corresponding to object 4902 in three-dimensional space and updates the two-dimensional depiction of object 4902 accordingly. Furthermore, in some embodiments, scene-based image editing system 106 utilizes a three-dimensional grid of background to provide realistic modifications to the orientation of object 4902, such as by constraining certain portions of object 4902 from contact with the ground (e.g., by ensuring that an animal's foot is in contact with the ground when rotating object 4902). Similarly, scene-based image editing system 106 may utilize such constraints to modify the position of object 4902 along the ground to maintain consistent contact between a particular portion of object 4902 and the background according to one or more contours of the background.

The scene based image editing system 106 may utilize these three-dimensional modeling and shadow generation methods in conjunction with the various image editing methods described above. For example, the scene-based image editing system 106 may preprocess the digital image to identify foreground objects, generate a three-dimensional model of the foreground objects, and repair behind the foreground objects. In response to a user selection of an object (e.g., finger press), the image editing system may move the object and generate a dynamic shadow that falls on and twists over the three-dimensional contours of other objects depicted in the scene.

In some embodiments, the scene-based image editing system 106 utilizes three-dimensional characteristics and/or three-dimensional representations of the two-dimensional image to determine depth and/or scale information associated with the content of the two-dimensional image. For example, the scene-based image editing system 106 utilizes three-dimensional characteristics of the detected ground features and camera parameters in the two-dimensional image to estimate pixel and metric scaling (e.g., number of pixels to metric distance/height) corresponding to a particular pixel location in the two-dimensional image. Fig. 50 shows an overview of a scene-based image editing system 106 utilizing three-dimensional characteristics of a two-dimensional image 5000 to generate a scale field that includes scale information for content in the two-dimensional image 5000. In particular, FIG. 50 illustrates a scene-based image editing system 106 utilizing a machine learning model to generate a scale field for performing one or more downstream operations.

In one or more embodiments, scene-based image editing system 106 generates scale field 5004 from two-dimensional image 5000 using scale field model 5002. In particular, the scene-based image editing system 106 provides a tool for automatically processing the two-dimensional image 5000 to determine scale information for the content of the two-dimensional image 5000. For example, the scene-based image editing system 106 provides tools for editing objects or inserting objects into the two-dimensional image 5000 by scaling content objects based on the two-dimensional image 5000. In another example, the scene-based image editing system 106 provides a tool for determining a metric distance for content depicted in the two-dimensional image 5000.

In accordance with one or more embodiments, the scale field model 5002 includes a machine learning model (e.g., a neural network including one or more neural network layers) for generating the scale field 5004. In particular, scene-based image editing system 106 utilizes scale field model 5002 to generate scale field 5004 representing a scale of metric distances relative to pixel distances. For example, the scene-based image editing system 106 generates the scale field 5004 to represent a ratio of a metric distance in a three-dimensional space of the two-dimensional image 5000 to a corresponding pixel distance in the two-dimensional space. Thus, scene-based image editing system 106 generates scale field 5004 to include a plurality of values indicative of pixel metric ratios of a plurality of pixels in two-dimensional image 5000. To illustrate, the values in the scale field 5004 represent, for a given pixel, a ratio between a distance from a point in two-dimensional space to the horizon and the corresponding three-dimensional space.

In one or more embodiments, scene-based image editing system 106 utilizes scale field model 5002 to generate scale field 5004 to include values of a subset of pixels of two-dimensional image 5000. For example, the scale field 5004 includes non-nulls or non-zero values corresponding to pixels below the horizon of the two-dimensional image 5000. In accordance with one or more embodiments, scene-based image editing system 106 generates scale field 5004 as a matrix of values corresponding to a matrix of pixels in two-dimensional image 5000. In some examples, scene-based image editing system 106 generates scale field 5004 for storage in memory while editing two-dimensional image 5000, or as a separate file (or metadata) corresponding to two-dimensional image 5000 for various downstream operations.

As shown in fig. 50, the scene-based image editing system 106 utilizes the scale field 5004 to perform one or more additional downstream operations associated with the two-dimensional image 5000. For example, scene-based image editing system 106 utilizes scale field 5004 to generate modified two-dimensional image 5006 based on two-dimensional image 5000. To illustrate, scene-based image editing system 106 generates modified two-dimensional image 5006 by inserting objects into particular locations in two-dimensional image 5000 according to scale field 5004. In some embodiments, scene-based image editing system 106 generates modified two-dimensional image 5006 by moving objects in two-dimensional image 5000 from one location to another location according to scale field 5004.

In additional embodiments, the scene-based image editing system 106 utilizes the scale field 5004 to determine a metric distance in the two-dimensional image 5000. For example, scene-based image editing system 106 utilizes scale field 5004 to determine a generation distance 5008 associated with the content of two-dimensional image 5000. To illustrate, the scene-based image editing system 106 utilizes the scale field 5004 to determine a size, length, width, or depth of content within the two-dimensional image 5000 based on one or more values of the scale field 5004 of one or more pixels of the two-dimensional image 5000. Thus, the scene-based image editing system 106 utilizes the scale field 5004 to provide information associated with metric distance measurements corresponding to content in the two-dimensional image 5000 based on pixel distances of the content depicted in the two-dimensional image 5000.

In accordance with one or more embodiments, by utilizing the scale field model of FIG. 50 to generate a scale field of a two-dimensional image, the scene-based image editing system 106 provides scale-aware information of digital images that is better than conventional systems. In particular, in contrast to conventional systems that utilize camera internal/external parameters to transition between two-dimensional and three-dimensional measurements, the scene-based image editing system 106 provides accurate scaling based on translations between two-dimensional and three-dimensional spaces directly from two-dimensional images. Furthermore, in contrast to conventional systems that utilize single view metrics to establish relationships between low-level image features such as vanishing points and vanishing lines, the scene-based image editing system 106 trains a scale field model using annotated scale fields on the digital image to more accurately determine scaling of objects using the scale fields.

In one or more embodiments, scene-based image editing system 106 utilizes a scale field model to generate a scale field for analyzing and/or modifying a digital image. In particular, as described above, scene-based image editing system 106 utilizes a scale field to provide accurate scale-aware processing of two-dimensional image content. For example, the scene-based image editing system 106 generates a scale field representing a conversion of two-dimensional measurements in pixels to three-dimensional measurements in a corresponding three-dimensional space. To illustrate, the scene-based image editing system 106 generates a scale field to provide such scale-aware data by utilizing the estimated distance of the two-dimensional image to the camera and the estimated parameters of the camera during training of the scale field model.

According to one or more embodiments, at each ground pixel of a two-dimensional image, the pixel height grows highly linearly (or approximately linearly) with the corresponding three-dimensional metric according to the perspective camera model. By locally defining the ratio between the pixels and their corresponding metric heights, the scene-based image editing system 106 generates a scale field. Thus, the scene-based image editing system 106 provides a dense, localized, non-parametric representation of the dimensions of the scene in a two-dimensional image.

As described above, in one or more embodiments, scene-based image editing system 106 generates a scale field of a two-dimensional image using a machine learning model. FIG. 51 illustrates an embodiment of a scale field model used by scene-based image editing system 106 to determine the scale of content in two-dimensional image 5100. In particular, FIG. 51 illustrates an embodiment of a scale field model that includes multiple branches for generating multiple types of data associated with a scale of content in a two-dimensional image 5100.

As shown in fig. 51, the scale field model includes a plurality of neural network layers in a codec architecture. In particular, the scale field model includes an encoder 5102 that encodes features of a two-dimensional image 5100. In one or more embodiments, the encoder 5102 includes a transformer-based feature extractor for extracting features from the two-dimensional image 5100. For example, the scene-based image editing system 106 utilizes the encoder 5102 of the scale field model to generate a feature representation of the two-dimensional image 5100 based on the extracted features of the two-dimensional image 5100.

In response to generating the feature representation from the two-dimensional image 5100, the scene-based image editing system 106 utilizes the scale field model of FIG. 51 to generate a plurality of different outputs. More specifically, the scale field model includes a scale field decoder 5104 ("SF decoder") to generate a scale field 5106 based on the feature representation. To illustrate, the scale field decoder 5104 processes the feature representation generated by the encoder 5102 to generate a scale field 5106 in a first branch of a scale field model. In one or more embodiments, the scale field model includes a scale field decoder 5104 to generate a scale field 5106 of the same resolution as the two-dimensional image 5100. For example, the encoder 5102 generates a feature representation of a downsampling resolution from the two-dimensional image 5100, and the scale field decoder 5104 decodes and upsamples the feature resolution to a higher resolution (e.g., the resolution of the two-dimensional image 5100).

According to one or more embodiments, the scale field model further comprises additional branches. Specifically, as shown in fig. 51, the scale field model includes a neural network branch with a ground-to-horizon decoder 5108 ("G2H decoder"). In particular, the ground-to-horizon decoder 5108 decodes the feature representation to generate a plurality of ground-to-horizon vectors 5110. In one or more embodiments, the ground-to-horizon vector includes a vector indicating a direction and distance from a particular point in three-dimensional space to the horizon. For example, the ground-to-horizon decoder 5108 generates ground-to-horizon vectors 5110 for a plurality of ground points depicted in the two-dimensional image 5100 based on projections of content of the two-dimensional image 5100 projected into a three-dimensional space to indicate a vertical distance from the ground points to the horizon.

As shown in fig. 51, scene-based image editing system 106 generates a scale field 5106 and a ground-to-horizon vector 5110 from a two-dimensional image 5100 using a scale field model. Although fig. 51 shows a scale field model generating scale field 5106 and ground-to-horizon vector 5110, in an alternative embodiment, scene-based image editing system 106 utilizes a scale field model that generates only scale fields of a two-dimensional image. For example, the scene-based image editing system 106 utilizes a single-branch neural network to generate a scale field based on a two-dimensional image.

In accordance with one or more embodiments, scene-based image editing system 106 performs one or more downstream operations using scale field 5106 and ground-to-horizon vector 5110. For example, scene-based image editing system 106 utilizes scale field 5106 to measure metric distances of content depicted in a digital image or other intra-scale image compositing operation. In another example, scene-based image editing system 106 utilizes scale field 5106 to interpolate objects or move objects within a digital image, including use in architectural or furniture applications. In addition, scene-based image editing system 106 utilizes ground-to-horizon vector 5110 to determine the angle of placement (e.g., rotation/direction) of objects within the digital image.

In some embodiments, scene-based image editing system 106 trains one or more neural networks to generate scale-aware information associated with the two-dimensional image. For example, the scene-based image editing system 106 generates training data comprising a plurality of annotated two-dimensional images for learning parameters of the scale field model of FIG. 51. In particular, as described in more detail below with reference to FIG. 54, the scene-based image editing system 106 generates a dataset of annotated two-dimensional images that includes scale information for training a scale field model that automatically generates a scale field based on content in the two-dimensional images.

In conjunction with generating the annotated two-dimensional image, scene-based image editing system 106 determines scale information associated with the two-dimensional image. For example, fig. 52 shows a representation of a two-dimensional image 5200 in which the two-dimensional characteristics of the content of the two-dimensional image 5200 are projected onto the three-dimensional characteristics. In one or more embodiments, as shown, transforming the two-dimensional characteristic into the three-dimensional characteristic optionally involves determining an estimated depth value 5202 from the two-dimensional image 5200. To illustrate, as previously described, the scene-based image editing system 106 utilizes a depth estimation neural network to estimate depth values for pixels of a two-dimensional image.

As shown in fig. 52, the scene-based image editing system 106 determines a three-dimensional space 5204 corresponding to the content of the two-dimensional image 5200 by projecting the content of the two-dimensional image 5200 into the three-dimensional space 5204. For example, the scene-based image editing system 106 determines the three-dimensional space 5204 based on the estimated depth value 5202. To illustrate, the scene-based image editing system 106 converts the estimated depth values 5202 into three-dimensional points in the three-dimensional space 5204 by generating a model using one or more neural networks, such as an adaptive mosaic model or other three-dimensional mesh.

In an additional embodiment, the scene-based image editing system 106 projects the content of the two-dimensional image 5200 into the three-dimensional space 5204 by identifying specific content in the two-dimensional image 5200. Specifically, the scene-based image editing system 106 determines an estimated ground 5206 in the two-dimensional image 5200. For example, scene-based image editing system 106 identifies ground pixels in two-dimensional image 5200 that correspond to the ground.

Further, in one or more embodiments, the scene-based image editing system 106 determines annotations indicative of camera parameters associated with the two-dimensional image 5200. In particular, the scene-based image editing system 106 determines a camera height 5208 corresponding to a camera position of a camera capturing the two-dimensional image 5200. In some embodiments, the scene-based image editing system 106 determines a focal length 5210 and a camera spacing 5212 associated with the camera. For example, the scene-based image editing system 106 determines camera parameters that affect the position, rotation, tilt, focus, etc. of content depicted in the two-dimensional image 5200. In additional embodiments, the scene-based image editing system 106 determines annotations indicating a particular height or distance in the two-dimensional image 5200. In accordance with one or more embodiments, the scene-based image editing system 106 utilizes one or more neural networks to determine camera parameters. Further, in some embodiments, the scene-based image editing system 106 utilizes the camera height 5208 to determine the horizon 5214 of the two-dimensional image 5200.

Alternatively, in one or more embodiments, the scene-based image editing system 106 determines annotations in the two-dimensional image 5200 that indicate certain information that allows the scene-based image editing system 106 to determine the three-dimensional space 5204 and camera parameters in response to user input. For example, the scene-based image editing system 106 (or another system) provides the two-dimensional image 5200 along with a plurality of additional digital images to a plurality of human annotators to annotate the horizon. Thus, the scene-based image editing system 106 obtains annotated digital images and uses one or more image processing models to determine three-dimensional space (e.g., estimated camera height, horizon, and ground) based on annotations of human origin or other sources.

Fig. 53A to 53C show diagrams indicating a relationship between a metric distance and a pixel distance in a two-dimensional image. In addition, fig. 53A to 53C show the influence of camera parameters on the relationship between the metric distance and the pixel distance. Specifically, fig. 53A shows that points in the image plane 5300 are projected to points in the three-dimensional space according to camera parameters of the camera 5302. Fig. 53b shows a graphical representation of various camera parameters of a camera for capturing two-dimensional images. Further, fig. 53C shows a relationship between a metric distance in a three-dimensional space and a pixel distance in a two-dimensional space according to camera parameters and projections of points.

As described above, fig. 53A shows projection from a two-dimensional image to a point in a three-dimensional space. For example, fig. 53A shows projections of a plurality of points corresponding to different pixels on the image plane 5300 of a two-dimensional image onto a plurality of ground points in a three-dimensional space corresponding to the two-dimensional image. To illustrate, based on camera parameters of camera 5302, a first point 5304a in image plane 5300 corresponding to a first pixel height is projected to a first ground point 5306a in three-dimensional space. In addition, based on camera parameters of camera 5302, a second point 5304b in image plane 5300 corresponding to a second pixel height is projected to a second ground point 5306b in three-dimensional space. Fig. 53A also shows that a third point 5304c in the image plane 5300 corresponding to a third pixel height is projected to a third ground point 5306c in three-dimensional space based on camera parameters of the camera 5302.

In one or more embodiments, as shown in fig. 53A, the horizon 5308 in three-dimensional space corresponds to the camera height 5310 of the camera 5302. Specifically, the horizon 5308 corresponds to a visual boundary separating the ground from the sky in the two-dimensional image. Furthermore, in one or more embodiments, horizon 5308 is equal to camera height 5310. Thus, when projecting the content of the two-dimensional image into the three-dimensional space, the camera height 5310 of the determination camera 5302 also indicates the horizon 5308 in the three-dimensional space.

In one or more additional embodiments, the two-dimensional image includes ground pixels 5312 that correspond to a ground 5314 projected into three-dimensional space as a single plane from which the camera height 5310 is determined. Specifically, the ground pixel 5312 includes a pixel below a point in the image plane 5300 corresponding to the horizon 5308 in three-dimensional space. Thus, projecting pixels below the point in the image plane 5300 corresponding to the horizon 5308 into three-dimensional space results in projecting pixels to ground points on the ground 5314. Furthermore, projecting pixels above the horizon into three-dimensional space does not result in projecting pixels to ground points on the ground 5314.

Fig. 53A also shows that by extending the horizon 5308 in three dimensions, the horizon 5308 is at an equidistant position relative to the ground 5314. Specifically, fig. 53A shows a first ground-to-horizon vector 5316a representing a first distance from a first ground point 5306a to horizon 5308, a second ground-to-horizon vector 5316b representing a second distance from a second ground point 5306b to horizon 5308, and a third ground-to-horizon vector 5316c representing a third distance from a third ground point 5306c to horizon 5308. More specifically, the first ground-to-horizon vector 5316a, the second ground-to-horizon vector 5316b, and the third ground-to-horizon vector 5316c indicate the same distance (e.g., the same metric height) from the ground 5314 to the horizon 5308.

Fig. 53B illustrates various camera parameters of the camera 5302 of fig. 53A. Specifically, fig. 53B shows a camera pitch 5318 indicating an angle θ of the camera 5302 with respect to the horizon 5308. For example, a camera pitch of 0 degrees indicates that the camera 5302 is pointing in a horizontal direction, and a camera pitch of 45 degrees indicates that the camera 5302 is pointing in a downward direction midway between horizontal and vertical. Further, fig. 53B shows a focal length 5320 of the camera 5302, representing a distance between a center of a lens of the camera 5302 and a focal point (e.g., a point on the image plane 5300).

Fig. 53C shows a relationship between a three-dimensional space of a two-dimensional image according to camera parameters of the camera 5302 and distances in the two-dimensional space. For example, fig. 53C shows a ground-to-horizon vector 5322 corresponding to the distance between a point on the ground 5314 and the horizon 5308. Further, fig. 53C shows a ground-to-point vector 5324 corresponding to a distance between a point on the ground 5314 and a specific point in three-dimensional space. To illustrate, the ground-to-point vector 5324 indicates the height of an object located at a point on the ground 5314.

In one or more embodiments, as shown in fig. 53C, the ground-to-horizon vector 5322 in three-dimensional space corresponds to a first pixel distance 5326 on the image plane 5300 (e.g., in two-dimensional space). Further, the ground-to-point vector 5324 in three-dimensional space corresponds to a second pixel distance 5328 on the image plane 5300. As shown, the difference between the first pixel distance 5326 and the second pixel distance 5328 corresponds to the difference between the ground-to-ground vector 5322 and the ground-to-point vector 5324. To illustrate, the difference in the two-dimensional space has a linear relationship (or approximately a linear relationship) with respect to the difference in the three-dimensional space. Thus, changing the ground-to-point vector 5324 to the second pixel distance 5328 results in a linear change in the difference in three-dimensional space and two-dimensional space.

In one or more embodiments, the scene-based image editing system 106 determines a linear relationship between pixel distances and metric distances in three-dimensional space based on a ratio formula. For example, the scene-based image editing system 106 determines a relationship between pixel heights and metric heights on fixed ground points. More specifically, scene-based image editing system 106 determines camera height h _cam A camera pitch θ, a focal length f, and a z-axis distance d from the camera to the two vectors. For example, the first vector at a fixed ground point includes a first metric height h ₁ And the second vector at the fixed ground point includes a second metric h ₂ . The pixel distance of the line on the image plane comprises a first pixel distance ph corresponding to the first vector ₁ And a second pixel distance ph corresponding to the second vector ₂ . Further, scene-based image editing system 106 determines the following pixel distances:

/>

the scene-based image editing system 106 modifies the pixel distances described above by generating an approximation of the linear relationship as follows:

based on the above-described determination indicating a linear relationship between metric distance and pixel distance, scene-based image editing system 106 determines a two-dimensional vector field in a two-dimensional image for which a plurality of vectors begin at ground pixels and end at intersections with the horizon. In particular, the scene-based image editing system 106 determines a plurality of ground-to-horizon vectors that are perpendicular to a ground plan in two-dimensional space when projected into three-dimensional space. Further, as previously described, the ground-to-horizon vector has the same metric distance corresponding to the camera height. The horizon vector also has a linear relationship between the pixel and the metric distance. Thus, the scene-based image editing system 106 defines the scale field SF by dividing the pixel magnitude of the ground-to-horizon vector by the absolute measure height of the camera:

Where (x, y) is a two-dimensional coordinate and ph is the pixel height of the ground-to-horizon vector from (x, y) normalized by the image height and width. The generated scale field is a two-dimensional map of per-pixel values indicative of the pixel to metric ratio.

In one or more embodiments, scene-based image editing system 106 determines a scale field that provides information for each ground pixel in the two-dimensional image that indicates how many pixels represent a certain amount of verticality metric length in the projected three-dimensional space. By generating a scale field, the scene-based image editing system 106 enables various scale-aware operations on a two-dimensional image, such as three-dimensional understanding or scale-aware image editing. Furthermore, in some embodiments, scene-based image editing system 106 utilizes the scale field of the two-dimensional image to further improve the performance of the neural network used to determine the depth estimate of the two-dimensional image. Furthermore, while the above examples describe determining pixel-to-metric heights in a two-dimensional image, the scene-based image editing system 106 may also utilize a scale field to determine other pixel-to-metric distances based on the pixel-to-metric heights. In an alternative embodiment, scene-based image editing system 106 generates a scale field to represent the metric-to-pixel height based on the metric-to-pixel height.

In accordance with one or more embodiments, the scene-based image editing system 106 utilizes annotated two-dimensional images to train one or more neural networks in connection with generating scale-aware data from the two-dimensional images. For example, as shown in fig. 54, the scene-based image editing system 106 trains the machine learning model 5402 with a dataset comprising a two-dimensional image 5400. In particular, in one or more embodiments, the scene-based image editing system 106 utilizes a training data set comprising two-dimensional images 5400 to modify parameters of the machine learning model 5402 to generate scale field data and/or additional scale awareness information (e.g., ground-to-horizon vectors).

In one or more embodiments, scene-based image editing system 106 generates a training dataset comprising two-dimensional image 5400 by annotating two-dimensional image 5400 with scale awareness information. For example, the scene-based image editing system 106 automatically annotates (e.g., via one or more additional machine learning models) one or more portions of the two-dimensional image 5400. In some embodiments, scene-based image editing system 106 annotates one or more portions of two-dimensional image 5400 based on user input (e.g., via one or more human annotations). Further, in some embodiments, scene-based image editing system 106 trains machine learning model 5402 with various scene types, image types, and/or camera parameters.

In at least some embodiments, the scene-based image editing system 106 determines the annotation 5404 of the two-dimensional image 5400 based on camera parameters. To illustrate, the scene-based image editing system 106 determines scale information for a dataset of web images or other two-dimensional images based on internal and external camera parameters. In additional embodiments, the scene-based image editing system 106 determines annotations based on additional sensor systems and/or metadata associated with the two-dimensional image 5400. Thus, the scene-based image editing system 106 utilizes one or more types of sources to annotate the two-dimensional image 5400 with field of view, pitch, roll, and/or camera height parameters. Further, the scene-based image editing system 106 utilizes camera parameters to determine the horizon and ground-to-horizon vector 5400 for the two-dimensional image. 55A-55D and 56A-56E and corresponding descriptions provide additional details regarding annotating two-dimensional images.

In conjunction with generating a dataset comprising a two-dimensional image 5400 of multiple annotations 5404, the scene-based image editing system 106 utilizes the machine learning model 5402 to generate predicted scale information. For example, as shown in fig. 54, the scene-based image editing system 106 utilizes a first machine learning model to generate a predicted scale field 5406 of the two-dimensional image 5400. Further, as shown in fig. 54, the scene-based image editing system 106 utilizes a second machine learning model to generate a predicted ground-to-horizon vector 5408 for the two-dimensional image 5400.

In response to generating the predicted scale field 5406 and/or the predicted ground-to-horizon vector 5408, the scene-based image editing system 106 utilizes the annotations 5404 to determine one or more losses. In particular, scene-based image editing system 106 determines reference truth scale field 5410 and reference truth ground-to-horizon vector 5412 from annotation 5404. For example, scene-based image editing system 106 compares predicted scale field 5406 with a reference truth scale field to determine first penalty 5414. Further, the scene-based image editing system 106 compares the predicted ground-to-horizon vector 5408 to determine the second penalty 5416.

In one or more embodiments, the scene-based image editing system 106 utilizes the first loss 5414 and/or the second loss 5416 to modify parameters of the one or more machine learning models 5402. For example, the scene-based image editing system 106 modifies parameters of the first machine learning model that generates the scale field based on the first penalty 5414. Further, the scene-based image editing system 106 modifies parameters of a second machine learning model that generates a ground-to-horizon vector based on the second penalty 5416. In some embodiments, the scene-based image editing system 106 utilizes a single model (e.g., a multi-branch model) to generate the predicted scale field 5406 and the predicted ground-to-horizon vector 5408. Thus, scene-based image editing system 106 utilizes first loss 5414 and second loss 5416 to modify parameters of a single model (e.g., by modifying parameters of separate branches).

In some embodiments, the scene-based image editing system 106 utilizes regression losses (e.g., mean square error losses with equal loss weights) to determine the first loss 5414 and the second loss 5416. For example, scene-based image editing system 106 determines by normalizing predicted scale field 5406, predicted ground-to-horizon vector 5408, reference truth scale field 5410, and reference truth ground-to-horizon vector 5412Loss. To illustrate, the scene-based image editing system 106 normalizes the data according to the respective mean and variance values. More specifically, the scene-based image editing system 106 determines the output of a fully connected layer having multiple channels, which is softmax and weighted summed by a predefined bin value (bin value). In accordance with one or more embodiments, scene-based image editing system 106 determines bin ranges and distributions for global parameter estimation according to table 1 below, whereinAnd->Respectively, uniform distribution and normal distribution. Further, the horizon offset is the vertical distance of the horizon from the center of the image, setting the upper left corner as the origin.

55A-55D illustrate two-dimensional images including a plurality of different annotations associated with generating a scale field for the two-dimensional images. Specifically, fig. 55A shows an unexplored two-dimensional image 5500a. Specifically, as shown, the unexplored two-dimensional image 5500a includes a scene captured by a camera having a known (or estimated) camera height. Fig. 55b shows a first annotated two-dimensional image 5500b comprising a horizon 5502 corresponding to camera height. Fig. 55C illustrates a second annotated two-dimensional image 5500C having a plurality of ground-to-horizon vectors (e.g., ground-to-horizon vectors 5504) from a plurality of ground points to the horizon in a corresponding three-dimensional space.

Fig. 55D shows a two-dimensional image 5506 including scale field coverage. Specifically, as shown, the scale field includes a plurality of values for a plurality of pixels below the horizon. For example, the two-dimensional image 5506 includes a scale field overlay including a color value representing each pixel in the area below the horizon. As shown, each value represents a pixel to metric ratio corresponding to a parameter of the two-dimensional image. More specifically, as shown in fig. 55D, the value of the scale field is lower (e.g., indicating a lower pixel to depth ratio) and higher farther from the horizon (and closer to the camera position). Thus, the ratio of the pixel distance (in pixels) from each pixel to the horizon is lowest near the horizon with respect to the measured distance from the corresponding ground point in three-dimensional space to the horizon (in three-dimensional space).

As described above, in one or more embodiments, scene-based image editing system 106 utilizes a plurality of different types of digital images to train a machine learning model to determine scene perception data. In particular, the scene-based image editing system 106 utilizes the two-dimensional panoramic image to generate a training dataset. For example, the scene-based image editing system 106 utilizes panoramic images to extract a plurality of different images for scaling the training dataset. In some embodiments, the panoramic image provides different combinations of camera parameters while maintaining the same camera height. Fig. 56A to 56D illustrate panoramic images and a plurality of images extracted from the panoramic images.

For example, fig. 56A shows a panoramic image 5600 that includes a 360 degree view of space. Further, as shown, the scene-based image editing system 106 determines a plurality of separate two-dimensional images for the training dataset based on the panoramic image 5600. To illustrate, the scene-based image editing system 106 determines a first portion 5602, a second portion 5604, a third portion 5606, and a fourth portion 5608 of the panoramic image 5600. Each image extracted from panoramic image 5600 includes different camera parameters (e.g., pitch, roll) having the same camera height. Each image also includes a different view of the content within the space.

Fig. 56B shows a first image 5610a corresponding to a first portion 5602 of the panoramic image 5600. Fig. 56b also shows a overlaid first image 5610b comprising the scale field and the ground-to-horizon vector overlaid on top of the first image 5610a.

Fig. 56C shows a second image 5612a corresponding to a second portion 5604 of the panoramic image 5600. Fig. 56C also shows a second overlaid image 5612b that includes the scale field and the ground-to-horizon vector overlaid on top of the second image 5612a.

Fig. 56D shows a third image 5614a corresponding to a third portion 5606 of the panoramic image 5600. Fig. 56D also shows a third overlaid image 5614b that includes the scale field and the ground-to-horizon vector overlaid on third image 5614a.

Fig. 56E shows a fourth image 5616a corresponding to a fourth portion 5608 of the panoramic image 5600. Fig. 56E also shows a fourth overlaid image 5616b that includes the ground-to-horizon vector of the sum of the scale fields overlaid on top of the fourth image 5616a.

As shown in fig. 56B-56E, the scene-based image editing system 106 determines equidistant rectangular-to-perspective cropping of each individual portion of the panoramic image 5600. In conjunction with determining the separate crop, the scene-based image editing system 106 determines a scale field for each individual portion of the panoramic image 5600. Thus, the scene-based image editing system 106 generates multiple individual images of different combinations of camera parameters having the same camera height from a single panoramic view. Furthermore, as shown in separate images, the scene-based image editing system 106 determines a scale field relative to a particular horizon and corresponding ground-to-horizon vector for the different images. The scene-based image editing system 106 may similarly extract multiple images from a panorama of various indoor and outdoor scenes.

In one or more embodiments, the scene-based image editing system 106 generates a training dataset comprising digital images having multiple non-horizontal horizontalities. In particular, the scene-based image editing system 106 utilizes camera parameters in conjunction with horizon to determine whether the digital image is tilted due to camera roll and/or pitch. Scene-based image editing system 106 may utilize this information in training a machine learning model to interpret such camera roll and/or pitch in the processed digital image.

Fig. 57 illustrates a graphical user interface of a client device displaying a digital image 5700 that includes scale-aware information based on content of the digital image. In particular, the client device displays a digital image 5700 including a plurality of objects in a scene captured by a camera having particular parameters. In one or more embodiments, scene-based image editing system 106 processes digital image 5700 using one or more neural networks to generate a scale field for digital image 5700. The scene-based image editing system 106 (or another system such as a digital image editing system) utilizes the scale field to perform one or more downstream operations.

For example, as shown, the client device displays a digital image 5700 with scale-aware information overlaid on top of the digital image 5700. To illustrate, the scene-based image editing system 106 utilizes the scale field generated by the digital image 5700 to display, via the client device, a horizon 5702 corresponding to the camera height of the camera capturing the digital image 5700. Further, scene-based image editing system 106 generates a plurality of measurements based on metric distances extracted based on the scale field of digital image 5700. More specifically, the client device displays a digital image 5700 including a first object 5704, having a height line 5706 indicating a distance from the detected ground point to the top of the first object 5704. The client device also displays a measurement overlay 5708 indicating a measured distance of the elevation line 5706, which indicates a measured distance in three-dimensional space from pixels extracted from the scale field of pixels at the detected ground point to the measured value.

In additional embodiments, the scene-based image editing system 106 provides a metric distance for additional objects and/or portions of a two-dimensional image. For example, the scene-based image editing system 106 utilizes scale-aware information of the digital image to determine non-perpendicular distances within the two-dimensional image. To illustrate, the scene-based image editing system 106 utilizes a plurality of ground-to-horizon vectors and/or pixel-to-metric values from a scale field of a two-dimensional image to estimate a horizontal distance or other distance corresponding to a line within the two-dimensional image that is not perpendicular to the horizon.

In one or more embodiments, scene-based image editing system 106 measures a metric distance (e.g., height or width) within a two-dimensional image based on a scale field value of a selected pixel in the two-dimensional image. For example, with respect to measuring a distance from a first pixel (e.g., ground point) to a second pixel in a two-dimensional image, such as from a bottom to a top of an object, the scene-based image editing system 106 determines a value in a scale field for the two-dimensional image at the first pixel that indicates a ratio of a pixel height at the first pixel to a camera height. The scene-based image editing system 106 converts the pixel distances associated with the objects in the two-dimensional image to metric distances for the objects using the indicated ratios (e.g., 50 pixels representing 2 meters at the ground point). In some embodiments, scene-based image editing system 106 utilizes the scale field values of more than one pixel to determine a distance from one ground point to another ground point.

In accordance with one or more embodiments, scene-based image editing system 106 also utilizes scale-aware information to modify digital images. For example, fig. 58 shows a plurality of digital images modified by inserting objects into a scene. Specifically, fig. 58 shows a first person profile 5802a entering a reference truth height of a reference truth image 5800a at a particular location. In addition, in conjunction with determining the reference truth image 5800a, the scene-based image editing system 106 determines camera parameters and a horizon associated with the reference truth image 5800 a.

In one or more embodiments, scene-based image editing system 106 inserts an object into a two-dimensional image by determining scale field values that indicate pixels at ground points at the object insertion point. For example, scene-based image editing system 106 determines scale field values for pixels to determine a ratio of pixel distance to metric distance for an insertion point. The scene-based image editing system 106 utilizes knowledge of the distance (e.g., a known height) associated with the inserted object and converts that distance to a pixel distance. Scene-based image editing system 106 scales the object for insertion at the insertion point based on the pixel distance determined based on the scale field value. In addition, the scene-based image editing system 106 modifies the scale of the object in response to changing the position of the object within the image based on one or more additional scale field values of one or more additional pixels in the two-dimensional image.

FIG. 58 also shows a number of modified digital images, including human contours at the same location using various model scales. In particular, the scene-based image editing system 106 utilizes the scale field model described above (e.g., in fig. 51) to generate a first modified image 5800b from the scale field and the ground-to-horizon vector generated for the first modified image 5800b, the first modified image 5800b including the human contour 5802b at the location. Fig. 58 also shows a second modified image 5800c, including a human outline 5802c at a location according to the ground-to-horizon vector and camera height estimated for the second modified image 5800 c. Fig. 58 also shows a third modified image 5800d, including a human outline 5802d at a location according to a plurality of camera parameters (e.g., horizontal offset/horizon, field of view, camera roll, and camera height) estimated for the third modified image 5800 d. As shown, the scene-based image editing system 106 utilizes the scale field to generate a first modified image 5800b with precise scaling relative to the reference truth image 5800a, while other models produce imprecise scaling.

Table 2 below includes measurements of model scaling performance for a plurality of different models across a plurality of different image datasets. Specifically, table 2 includes quantitative estimates of model performance (e.g., performance of model 1, model 2, and model 3) for samples from various data sets. Model 1 includes a scale field model used by scene-based image editing system 106 trained on a panoramic dataset. Model 2 includes a model for generating the second modified image 5800c of fig. 58 trained on a panoramic dataset above. Model 3 includes a model for generating the third modified image 5800d of fig. 58 trained on a panoramic dataset above. Model 1 x and model 2 x refer to model 1 and model 2 trained on the panoramic dataset and the web image dataset. Furthermore, stanford2D3D corresponds to the dataset described below: iro Armeni Sasha Sax Amir R.Zamir, and Silvio Savaree, "join 2D-3D-semantic data for indoor scene understanding" (Joint 2D-3D semantic data for indoor scene understanding), arXiv:1702.01105 (2017). Matterport3D corresponds to the dataset described below: angel Chang, angela Dai, thomas Funkhouser, maciej Halber, matthias Niessner, manolis Sava, shacan Song, andy Zeng, and YInda Zhang, "Matterport3D: learning from rgb-D data in indoor environments" (Matterport 3D: learned from rgb-D data in indoor environments), 3D visual International conference (2017).

As shown in the table above, scene-based image editing system 106 utilizes scale fields to provide higher zoom accuracy than other models. In particular, the scale metric indicates the performance when global parameter predictions are replaced with dense field information provided in the scale field. Furthermore, as indicated, training the scale field model used by the scene-based image editing system 106 on panoramic and web images significantly improves the zoom performance on web images that include a large number of scenes without significantly degrading the performance relative to other datasets having a limited range of camera heights in a particular indoor scene.

In one or more embodiments, as described above, the scene-based image editing system 106 generates a three-dimensional representation of a two-dimensional human being detected in a two-dimensional image. In particular, the scene-based image editing system 106 generates a three-dimensional human model representing a two-dimensional human in a two-dimensional image to perform a plurality of downstream operations. For example, FIG. 59 shows an overview of a scene-based image editing system 106 generating a three-dimensional representation based on two-dimensional humans in a two-dimensional image 5900. More specifically, FIG. 59 illustrates a scene-based image editing system 106 utilizing a three-dimensional representation of a two-dimensional human to modify a two-dimensional image based on a modification pose of the three-dimensional representation.

In one or more embodiments, the scene-based image editing system 106 utilizes a plurality of neural networks to generate a three-dimensional human model 5902 corresponding to a two-dimensional human extracted from the two-dimensional image 5900. In particular, scene-based image editing system 106 utilizes a neural network to extract two-dimensional human-based pose and shape information to generate three-dimensional human model 5902 in three-dimensional space. For example, as previously described, scene-based image editing system 106 generates a three-dimensional representation of content within a scene of two-dimensional image 5900. Thus, the scene-based image editing system 106 generates and inserts the three-dimensional human model 5902 into a particular location in three-dimensional space relative to other content of the two-dimensional image 5900.

In one or more embodiments, as shown in FIG. 59, scene-based image editing system 106 determines modified three-dimensional human model 5904 based on three-dimensional human model 5902. For example, scene-based image editing system 106 generates a re-posed three-dimensional human model in response to user input via a graphical user interface that interacts with three-dimensional human model 5902. Thus, scene-based image editing system 106 generates three-dimensional human model 5904 that includes modifications to the target pose.

Further, as shown in fig. 59, scene-based image editing system 106 utilizes modified three-dimensional human model 5904 to generate modified two-dimensional image 5906. In particular, scene-based image editing system 106 generates modified two-dimensional image 5906 comprising a modified two-dimensional image based on modified three-dimensional human model 5904. Thus, the scene-based image editing system 106 provides a tool for re-posing (reposing) a two-dimensional human gesture in a two-dimensional image based on a corresponding three-dimensional representation in three-dimensional space.

Although FIG. 59 shows scene-based image editing system 106 utilizing three-dimensional human model 5902 by re-pose three-dimensional human model 5902 to generate modified two-dimensional image 5906, in other embodiments, scene-based image editing system 106 utilizes three-dimensional human model 5902 to perform one or more additional downstream operations associated with two-dimensional image 5900. For example, the scene-based image editing system 106 utilizes the three-dimensional human model 5902 to determine interactions between objects of the two-dimensional image 5900 in three-dimensional space. Furthermore, in some embodiments, scene-based image editing system 106 utilizes three-dimensional human model 5902 to generate shadows in three-dimensional space (e.g., as described above with reference to fig. 40).

In accordance with one or more embodiments, the scene-based image editing system 106 provides real-time editing of human gestures in a two-dimensional image by utilizing multiple neural networks to generate a three-dimensional representation of a human in the two-dimensional image. In particular, in contrast to conventional systems that provide for the re-pose of a human in a two-dimensional image based on human gestures in additional two-dimensional images, the scene-based image editing system 106 provides for dynamically re-pose a human in a two-dimensional image in real-time based on user input to the two-dimensional image. More specifically, scene-based image editing system 106 provides for re-gesturing a human from a single monocular image.

In addition, the scene-based image editing system 106 provides accurate re-pose of a human in a two-dimensional image by extracting three-dimensional pose and three-dimensional shape information from the two-dimensional human in the two-dimensional image. In contrast to other systems that re-pose a human in a two-dimensional image based on pose of a different human in an additional two-dimensional image, the scene-based image editing system 106 utilizes three-dimensional understanding of the human in the two-dimensional image to re-pose the human in the two-dimensional image. In particular, the scene-based image editing system 106 utilizes the three-dimensional representation of the two-dimensional human to provide accurate re-pose of the three-dimensional representation of the two-dimensional human and reconstruct the two-dimensional human from the re-pose three-dimensional representation. In addition, in contrast to conventional systems, the scene-based image editing system 106 preserves the shape of the human body when re-posing the three-dimensional representation by extracting the shape and pose of the human directly from the two-dimensional image. Furthermore, as described in greater detail below, scene-based image editing system 106 also provides an improved user interface that reduces interactions and increases the efficiency of implementing the system in generating modified digital images (relative to conventional systems that require a large number of user interactions with a large number of tools and pixels to generate images with modified gestures).

As described above, in one or more embodiments, the scene-based image editing system 106 generates a three-dimensional representation of a two-dimensional human being extracted from a two-dimensional image. FIG. 60 shows a schematic diagram of a scene-based image editing system 106 utilizing multiple neural networks to generate a three-dimensional representation of a human in a two-dimensional image 6000. In particular, the scene-based image editing system 106 determines three-dimensional characteristics of a human being in the two-dimensional image to perform various downstream operations such as, but not limited to, re-pose the human being, generate shadows, and/or determine interactions with other objects in the two-dimensional image. To illustrate, the scene-based image editing system 106 utilizes a plurality of neural networks to extract two-dimensional and three-dimensional characteristics of a human detected in a two-dimensional image to reconstruct a three-dimensional human model based on the human.

In one or more embodiments, two-dimensional image 6000 includes two-dimensional human 6002. For example, two-dimensional image 6000 includes photographs or other images from which scene-based image editing system 106 extracts information associated with one or more persons. To illustrate, the scene-based image editing system 106 utilizes a two-dimensional pose neural network 6004 to extract two-dimensional pose data associated with a two-dimensional human. Specifically, the two-dimensional pose neural network 6004 includes a two-dimensional body tracker that detects/tracks a human in an image and generates two-dimensional pose data 6006 for the two-dimensional human 6002. More specifically, the two-dimensional pose data 6006 includes the pose of the two-dimensional human 6002 within the two-dimensional space (e.g., relative to the x-axis and y-axis) corresponding to the two-dimensional image 6000.

In accordance with one or more embodiments, the scene-based image editing system 106 utilizes a neural network as described below: zhe Cao, gines Hidalgo, tomas Simon, shih-En Wei, and Yaser Sheikh, "openPose: realtemulti-Person 2D Pose Estimation using Part Affinity Fields" (openPose: real-time Multi-Person two-dimensional pose estimation based on partial affinity fields), CVPS (2019), which is incorporated herein by reference in its entirety. For example, the two-dimensional gestural neural network 6004 utilizes a non-parametric representation ("site affinity field") to associate a particular body site with an individual detected in the digital image. In other embodiments, the scene-based image editing system 106 utilizes one or more additional neural networks that use human detection and body part/joint detection and/or segmentation to generate two-dimensional pose data 6006. For example, scene-based image editing system 106 utilizes a convolutional neural network-based model to detect articulated two-dimensional gestures through a grid image feature map.

In one or more additional embodiments, scene-based image editing system 106 utilizes three-dimensional pose/shape neural network 6008 to extract three-dimensional characteristics of two-dimensional human 6002 in two-dimensional image 6000. Specifically, the scene-based image editing system 106 utilizes the three-dimensional pose/shape neural network 6008 to generate three-dimensional pose data 6010 and three-dimensional shape data 6012 based on the two-dimensional human 6002. As described above, the scene-based image editing system 106 determines the three-dimensional pose of the two-dimensional human 6002 in three-dimensional space while maintaining the three-dimensional shape of the two-dimensional human 6002 in three-dimensional space. For example, the scene-based image editing system 106 generates a three-dimensional space that includes a three-dimensional representation of a scene of the two-dimensional image 6000, and generates three-dimensional pose data 6010 and three-dimensional shape data 6012 of the detected person relative to one or more other objects (e.g., a background) of the scene.

The scene-based image editing system 106 may utilize various machine learning architectures to reconstruct three-dimensional human gestures. In accordance with one or more embodiments, the scene-based image editing system 106 utilizes a neural network as described below: kevin Lin, lijuan Wang, and Zicheng Liu, "End-to-End Human Pose and Mesh Reconstruction with Transformers" (based on End-to-End human pose and mesh reconstruction of the transducer), CVPR (2021) (hereinafter "Lin"), which is incorporated herein by reference in its entirety. In particular, scene-based image editing system 106 utilizes a neural network to reconstruct three-dimensional human gestures and mesh vertices (e.g., shapes) from a single monocular image. For example, three-dimensional pose/shape neural network 6008 includes a transducer-based encoder for jointly modeling vertex-vertex and vertex-joint interactions to jointly generate three-dimensional joint coordinates and mesh vertices. In an alternative embodiment, scene-based image editing system 106 utilizes a separate neural network to generate three-dimensional pose data 6010 and three-dimensional shape data 6012, respectively, from two-dimensional image 6000.

In one or more embodiments, as shown in fig. 60, scene-based image editing system 106 utilizes two-dimensional and three-dimensional data to generate a three-dimensional representation of two-dimensional human 6002. Specifically, the scene-based image editing system 106 generates a three-dimensional human model 6014 by combining the two-dimensional pose data 6006 with the three-dimensional pose data 6010 and the three-dimensional shape data 6012. For example, scene-based image editing system 106 refines three-dimensional pose data 6010 using two-dimensional pose data 6006. Further, the scene-based image editing system 106 utilizes the three-dimensional shape data 6012 in combination with the refined three-dimensional pose data to generate a three-dimensional human model 6014 having the pose of the two-dimensional human 6002 in the two-dimensional image 6000 while maintaining the shape of the two-dimensional human 6002.

As described above, the scene-based image editing system 106 generates a three-dimensional representation of a two-dimensional human in a two-dimensional image by extracting pose and shape data from the two-dimensional image. For example, as shown in fig. 61A-61D, the scene-based image editing system 106 generates and combines two-dimensional data and three-dimensional data representing a two-dimensional human being in a two-dimensional image. In particular, the scene-based image editing system 106 utilizes multiple separate neural networks to extract two-dimensional and three-dimensional characteristics of humans in a two-dimensional image. The scene-based image editing system 106 also refines the three-dimensional data for a three-dimensional representation of a human using one or more optimization/refinement models.

For example, as shown in fig. 61A, the scene-based image editing system 106 generates two-dimensional pose data based on two-dimensional humans in the two-dimensional image 6100. In one or more embodiments, scene-based image editing system 106 generates image mask 6102 based on two-dimensional image 6100. In particular, the scene-based image editing system 106 determines a portion of the two-dimensional image 6100 that includes a human being and generates an image mask 6102 based on the identified portion of the two-dimensional image 6100. In additional embodiments, the scene-based image editing system 106 clips the two-dimensional image 6100 in response to detecting a human being in the two-dimensional image 6100. For example, the scene-based image editing system 106 utilizes a clipping neural network that automatically detects the two-dimensional image 6100 and clips it to portions of the two-dimensional image 6100 that include humans.

In one or more embodiments, scene-based image editing system 106 utilizes image mask 6102 to extract pose information from two-dimensional image 6100. For example, the scene-based image editing system 106 provides a two-dimensional image 6100 with an image mask 6102 to a two-dimensional gesture neural network 6104. Alternatively, scene-based image editing system 106 provides a cropped image based on two-dimensional image 6100 and image mask 6102 to two-dimensional gestural neural network 6104. In additional examples, the scene-based image editing system 106 provides a two-dimensional image 6100 (e.g., without cropping and without image masking) to the two-dimensional gestural neural network 6104. For example, the scene-based image editing system 106 provides a two-dimensional image 6100 to a two-dimensional gestural neural network 6104 that generates a cropped image corresponding to the two-dimensional human in the two-dimensional image 6100.

As described above, in one or more embodiments, the two-dimensional gestural neural network 6104 includes a two-dimensional body tracker that detects and identifies humans in two-dimensional images. In particular, scene-based image editing system 106 utilizes two-dimensional gestural neural network 6104 to detect a person in two-dimensional image 6100 (e.g., within a portion corresponding to image mask 6102). Further, the scene-based image editing system 106 utilizes the two-dimensional pose neural network 6104 to generate two-dimensional pose data corresponding to human poses in the two-dimensional image 6100.

For example, as shown in fig. 61A, the scene-based image editing system 106 utilizes a two-dimensional gestural neural network 6104 to generate a two-dimensional skeleton 6106 for a person in a two-dimensional image 6100. In particular, the scene-based image editing system 106 generates a two-dimensional skeleton 6106 by determining a skeleton 6108 (via various articulations) representing the body structure of a human in the two-dimensional image 6100. To illustrate, the scene-based image editing system 106 determines the length, position, and rotation of the bone 6108 corresponding to a particular body part (limb, torso, etc.), which is related to the two-dimensional space of the two-dimensional image 6100. Thus, the scene-based image editing system 106 generates a two-dimensional skeleton 6106 in terms of pixel coordinates from humans in the two-dimensional image 6100.

In additional embodiments, the scene-based image editing system 106 utilizes the two-dimensional gestural neural network 6104 to generate a bounding box 6110 corresponding to a portion of the two-dimensional image 6100. In particular, scene-based image editing system 106 generates bounding box 6110 to indicate one or more body parts of a human in two-dimensional image 6100. For example, the scene-based image editing system 106 marks body parts corresponding to one or more bones 6108 and/or one or more groups of bones in the two-dimensional skeleton 6106. To illustrate, the scene-based image editing system 106 generates bounding boxes related to the whole body of a human. In some embodiments, the scene-based image editing system 106 also generates separate bounding boxes corresponding to the hands (e.g., a first bounding box for a first hand and a second bounding box for a second hand).

In one or more embodiments, scene-based image editing system 106 generates annotation 6112 based on a human in two-dimensional image 6100. In particular, the scene-based image editing system 106 utilizes a two-dimensional pose neural network 6104 to determine one or more categories of humans based on visual features of the humans in combination with other pose data (e.g., a two-dimensional skeleton 6106). For example, the scene-based image editing system 106 generates annotations indicating whether the human whole body is visible in the two-dimensional image 6100, whether the detected pose is a standing pose (neutral or non-neutral) or a non-standing pose and/or orientation of the pose (e.g., anterior, lateral, posterior). In some embodiments, scene-based image editing system 106 generates additional annotations indicating other features, such as whether a human holds an object, what type of clothing the human wears, and other details that may affect the shape or pose of the human.

In connection with generating two-dimensional pose data for a human in two-dimensional image 6100, scene-based image editing system 106 also generates three-dimensional pose and shape data for a human. In at least some embodiments, the scene-based image editing system 106 utilizes one or more neural networks to extract three-dimensional characteristics of a human from the two-dimensional image 6100. For example, fig. 61B illustrates a scene-based image editing system 106 that utilizes multiple neural networks to extract three-dimensional pose/shape data for a particular portion of a human. To illustrate, the scene-based image editing system 106 generates separate three-dimensional pose/shape data for the whole-body portion and one or more hand portions of the two-dimensional image 6100.

As shown in fig. 61B, the scene-based image editing system 106 determines data corresponding to the two-dimensional image 6100 based on the two-dimensional pose data extracted from the two-dimensional image 6100. For example, the scene-based image editing system 106 determines an image mask 6102a associated with the two-dimensional image 6100. Further, in one or more embodiments, scene-based image editing system 106 determines bounding box 6110a and annotation 6112a generated by two-dimensional gestural neural network 6104 based on image mask 6102a and two-dimensional image 6100. Thus, the scene-based image editing system 106 provides data extracted from the two-dimensional image 6100 to a plurality of neural networks. In an alternative embodiment, scene-based image editing system 106 provides two-dimensional image 6100 to the neural network.

In one or more embodiments, scene-based image editing system 106 provides data extracted from two-dimensional image 6100 to one or more neural networks to generate three-dimensional pose data and three-dimensional shape data. In particular, the scene-based image editing system 106 utilizes a first neural network (e.g., three-dimensional pose/shape neural network 6114 a) to generate three-dimensional pose data and three-dimensional shape data for a first portion of a human in a two-dimensional image 6100. For example, the scene-based image editing system 106 provides a body bounding box corresponding to the body of a two-dimensional human in the two-dimensional image 6100 to the three-dimensional pose/shape neural network 6114 a. The scene-based image editing system 106 also provides one or more annotations associated with the body of the two-dimensional human to the three-dimensional pose/shape neural network 6114 a.

In additional embodiments, the scene-based image editing system 106 provides the data extracted from the two-dimensional image 6100 to a second neural network (e.g., the three-dimensional hand neural network 6114 b) to generate three-dimensional pose data and three-dimensional pose data of a second portion of the human in the two-dimensional image 6100. For example, the scene-based image editing system 106 provides one or more hand bounding boxes corresponding to one or more hands of a two-dimensional human in the two-dimensional image 6100 to the three-dimensional hand neural network 6114 b. In some embodiments, the scene-based image editing system 106 provides one or more annotations associated with the hand(s) of the two-dimensional human to the three-dimensional hand neural network 6114 b.

In accordance with one or more embodiments, the scene-based image editing system 106 utilizes a neural network to generate three-dimensional pose data and three-dimensional shape data for different portions of a two-dimensional human in a two-dimensional image 6100. In particular, the scene-based image editing system 106 generates body and hand pose/shape data using separate neural networks. For example, the scene-based image editing system 106 utilizes the three-dimensional pose/shape neural network 6114a to generate three-dimensional pose data and three-dimensional shape data for a body portion of a human in the two-dimensional image 6100. Further, the scene-based image editing system 106 utilizes the three-dimensional hand neural network 6114b to generate three-dimensional pose data and three-dimensional shape data for the hand portion(s) of the two-dimensional image 6100.

For example, the scene-based image editing system 106 utilizes the three-dimensional pose/shape neural network 6114a to generate a three-dimensional skeleton 6116a corresponding to a person in the two-dimensional image 6100. To illustrate, the scene-based image editing system 106 generates bones 6118a corresponding to humans in a two-dimensional image 6100 within a three-dimensional space. More specifically, the scene-based image editing system 106 determines the length, rotation, orientation, and relative position of the bone 6118a in three-dimensional space. Further, the scene-based image editing system 106 determines joints connecting bones in three-dimensional space, including determining one or more possible angles of rotation corresponding to the bones 6118a and their respective joints.

In one or more embodiments, the scene-based image editing system 106 also utilizes the three-dimensional pose/shape neural network 6114a to generate a three-dimensional shape 6120a for the body portion of the human in the two-dimensional image 6100. Specifically, the scene-based image editing system 106 generates the three-dimensional shape 6120a by generating a grid of multiple vertices within the three-dimensional space based on the shape of the human detected in the two-dimensional image 6100. For example, scene-based image editing system 106 generates vertices connected by edges from the detected shape of the human, each vertex having corresponding three-dimensional coordinates.

In conjunction with generating the three-dimensional skeleton 6116a and the three-dimensional shape 6120a, the scene-based image editing system 106 generates a three-dimensional body model 6122 corresponding to the body portion of the human in the two-dimensional image 6100. In particular, the scene-based image editing system 106 combines the three-dimensional skeleton 6116a with the three-dimensional shape 6120a to generate a three-dimensional body model 6122. For example, scene-based image editing system 106 generates three-dimensional body model 6122 with a default pose (e.g., an assembly grid with t-pose) based on three-dimensional shape 6120 a. The scene-based image editing system 106 modifies the three-dimensional body model 6122 according to the three-dimensional skeleton 6116a, for example by adjusting the pose of the three-dimensional shape 6120a to adapt portions of the three-dimensional body model 6122 to the skeleton 6118a.

In accordance with one or more embodiments, the scene-based image editing system 106 utilizes the three-dimensional hand neural network 6114b to generate a three-dimensional hand skeleton 6116b corresponding to a portion of a human in the two-dimensional image 6100. In particular, the scene-based image editing system 106 generates bones 6118b corresponding to the human hand in the two-dimensional image 6100 in three-dimensional space. For example, the scene-based image editing system 106 determines the length, rotation, orientation, and relative position of the bone 6118b in three-dimensional space. To illustrate, the scene-based image editing system 106 determines joints connecting hand bones in three-dimensional space, including determining one or more possible angles of rotation corresponding to the bones 6118b and their respective joints.

In accordance with one or more embodiments, the scene-based image editing system 106 also utilizes the three-dimensional hand neural network 6114b to generate a three-dimensional hand shape 6120b for the hand portion of the human in the two-dimensional image 6100. Specifically, the scene-based image editing system 106 generates a three-dimensional hand shape 6120b by generating a grid of multiple vertices within the three-dimensional space based on the shape of the hand detected in the two-dimensional image 6100. For example, scene-based image editing system 106 generates vertices connected by edges from the detected hand shape, each vertex having corresponding three-dimensional coordinates.

Further, the scene-based image editing system 106 generates a three-dimensional hand model 6124 corresponding to the hand portion of the human in the two-dimensional image 6100. In particular, the scene-based image editing system 106 combines the three-dimensional hand skeleton 6116b with the three-dimensional hand shape 6120b to generate a three-dimensional hand model 6124. In one or more embodiments, the scene-based image editing system 106 generates a three-dimensional hand model 6124 having a default pose (e.g., having a particular orientation of the hand in three-dimensional space and/or a particular splay of the fingers) based on the three-dimensional hand shape 6120b. Furthermore, the scene-based image editing system 106 modifies the three-dimensional hand model 6124 according to the three-dimensional hand skeleton 6116b, for example by adjusting the pose of the three-dimensional hand shape 6120b to adapt portions of the three-dimensional hand model 6124 to the skeleton 6118b.

In one or more embodiments, the scene-based image editing system 106 generates a plurality of three-dimensional hand models corresponding to each human hand in the two-dimensional image 6100. In particular, the scene-based image editing system 106 utilizes the three-dimensional hand neural network 6114b to generate a separate three-dimensional hand model for each hand in the two-dimensional image 6100. For example, the scene-based image editing system 106 generates a plurality of three-dimensional hand models using a plurality of hand bounding boxes extracted from the two-dimensional image 6100.

In accordance with one or more embodiments, the scene-based image editing system 106 uses the same neural network structure for the three-dimensional pose/shape neural network 6114a and the three-dimensional hand neural network 6114 b. For example, as previously described, scene-based image editing system 106 utilizes different instances of the neural network described in Lin to generate one or more three-dimensional representations of one or more portions of a human in two-dimensional image 6100. The scene-based image editing system 106 generates separate instances for extracting body-specific or hand-specific three-dimensional pose/shape data from a two-dimensional image. To illustrate, the scene-based image editing system 106 trains the three-dimensional pose/shape neural network 6114a to extract three-dimensional pose/shape data from the body of the two-dimensional image based on a training data set including the body and a corresponding three-dimensional body model. Further, the scene-based image editing system 106 trains the three-dimensional hand neural network 6114b based on a training data set including the hand and the corresponding three-dimensional hand model to extract three-dimensional pose/shape data from the hand of the two-dimensional image. In alternative embodiments, the scene-based image editing system 106 uses different architectures for the three-dimensional pose/shape neural network 6114a and the three-dimensional hand neural network 6114 b.

In response to generating the two-dimensional pose data and the three-dimensional pose data, the scene-based image editing system 106 performs one or more optimization operations to generate a final three-dimensional representation of the human in the two-dimensional image. For example, FIG. 61C illustrates the scene-based image editing system 106 performing a first optimization operation related to generating a three-dimensional human model. More specifically, the scene-based image editing system 106 utilizes a first optimization operation to combine three-dimensional data in the two-dimensional image corresponding to a human being with three-dimensional data in the two-dimensional image corresponding to one or more hands of the human being.

As shown in fig. 61C, scene-based image editing system 106 utilizes merge model 6126 to merge three-dimensional body model 6122 and three-dimensional hand model 6124 (e.g., as generated in fig. 61B). For example, the scene-based image editing system 106 utilizes the merge model 6126 to generate a three-dimensional human model 6128 by connecting a three-dimensional hand model 6124 with a three-dimensional body model 6122 in three-dimensional space from camera space. Specifically, given a cropped hand region corresponding to the three-dimensional hand model 6124, the scene-based image editing system 106 utilizes the merge model 6126 to generate a predicted three-dimensional hand joint position in camera space and a predicted two-dimensional joint position in image space (e.g., two-dimensional space corresponding to a two-dimensional image).

Furthermore, the scene-based image editing system 106 utilizes the merge model to assign predicted three-dimensional hand joint positions to wrists of the whole body three-dimensional joints in the three-dimensional hand model 6124. Specifically, the scene-based image editing system 106 utilizes the merge model to subtract the three-dimensional position of the wrist from the hand prediction (e.g., in the three-dimensional hand model 6124) and to add the three-dimensional position of the wrist from the whole-body prediction (e.g., from the three-dimensional body model 6122). Additionally, in one or more embodiments, the scene-based image editing system 106 (e.g., based on the image mask 6102a of fig. 61B) maps the image coordinates of the cropped image to the full image coordinates. The scene-based image editing system 106 replaces the whole-body predicted two-dimensional hand joint position with the predicted two-dimensional hand joint position according to the mapped coordinates using the merge model 6126. The scene-based image editing system 106 also uses the updated two-dimensional joints to connect the three-dimensional body model 6122 and the three-dimensional hand model 6124 to optimize the three-dimensional joints.

In accordance with one or more embodiments, the scene-based image editing system 106 also performs a second optimization operation to generate a final representation of the human being in the two-dimensional image. For example, FIG. 61D illustrates the scene-based image editing system 106 performing an optimization operation to refine three-dimensional pose data generated for a person in a two-dimensional image from two-dimensional pose data generated for the person. In particular, the scene-based image editing system 106 utilizes information about camera views associated with the two-dimensional image to modify the three-dimensional pose data based on the two-dimensional pose data.

As shown in fig. 61D, scene-based image editing system 106 refines three-dimensional skeleton 6116a (e.g., including skeleton 6118 a) based on two-dimensional skeleton 6106 (e.g., including skeleton 6108), as described with respect to fig. 61A. In one or more embodiments, the scene-based image editing system 106 utilizes the bone position refinement model 6130 to refine the positions, orientations, and joints corresponding to bones 6118a in the three-dimensional skeleton 6116a based on the positions, orientations, and joints corresponding to bones 6108 in the two-dimensional skeleton 6106.

In one or more embodiments, the scene-based image editing system 106 utilizes the bone position refinement model 6130 to modify the position and orientation of the bone 6118a in the three-dimensional skeleton 6116a to reduce the variance relative to the bone 6108 in the two-dimensional skeleton 6106. For example, the scene-based image editing system 106 provides the bones 6118a in the three-dimensional skeleton 6116a to the bone position refinement model 6130 with the bones 6108 of the two-dimensional skeleton 6106 as guiding references. The scene-based image editing system 106 iteratively adjusts the three-dimensional skeleton 6116a using the skeleton position refinement model 6130 to reduce the difference between the three-dimensional skeleton 6116a and the two-dimensional skeleton 6106. In one or more embodiments, the scene-based image editing system 106 jointly modifies the position and orientation of the bone 6118a of the three-dimensional skeleton 6116a to maintain the structure/shape of the three-dimensional skeleton 6116a according to the shape of the three-dimensional human model.

In one or more embodiments, the scene-based image editing system 106 generates a three-dimensional representation of a two-dimensional human in a two-dimensional image for performing various downstream operations. For example, the scene-based image editing system 106 generates a three-dimensional representation for re-pose a two-dimensional human in a two-dimensional image. FIG. 62 shows an illustration of the scene-based image editing system 106 modifying a two-dimensional image 6200 comprising a two-dimensional human. More specifically, the scene-based image editing system 106 modifies the two-dimensional image 6200 by modifying the pose of the two-dimensional human via the three-dimensional representation of the two-dimensional human.

In accordance with one or more embodiments, the scene-based image editing system 106 generates a three-dimensional human model 6202 representing a two-dimensional human in a two-dimensional image 6200. Specifically, as described above, the scene-based image editing system 106 utilizes one or more neural networks to generate the three-dimensional human model 6202. For example, the scene-based image editing system 106 extracts pose and shape data associated with two-dimensional humans in two-dimensional space and three-dimensional space to generate a three-dimensional human model 6202 in three-dimensional space.

In at least some embodiments, the scene-based image editing system 106 provides the three-dimensional human model 6202 for display at a client device for modifying the pose of the three-dimensional human model 6202. For example, the scene-based image editing system 106 determines a modified pose of the three-dimensional human model 6202 based on the re-pose input 6206. To illustrate, the re-pose input 6206 includes an input to directly modify a pose of the three-dimensional human model 6202 via one or more graphical user interface elements. The scene based image editing system 106 generates a modified three-dimensional human model 6204 from the modified pose.

In some embodiments, in conjunction with modifying the two-dimensional image 6200 based on the modified three-dimensional human model 6204, the scene-based image editing system 106 also extracts a texture map 6208 corresponding to the three-dimensional human model 6202. In particular, scene-based image editing system 106 extracts texture map 6208 from pixel values of two-dimensional image 6200 associated with three-dimensional human model 6202. For example, the scene-based image editing system 106 utilizes a neural network to generate texture map 6208, including UV mapping from image space to three-dimensional human model 6202. Thus, texture map 6208 includes pixel values that map to particular points (e.g., vertices or faces) of three-dimensional human model 6202 based on pixel values and corresponding locations in two-dimensional image 6200.

Further, in one or more embodiments, scene-based image editing system 106 determines an intermediate representation of modified three-dimensional human model 6204. In particular, the scene-based image editing system 106 generates a dense representation of the modified three-dimensional human model 6204 by assigning a particular value to each point in the modified three-dimensional human model 6204 (e.g., a unique value for each point on the body in a two-dimensional array). In some embodiments, the values in the dense representation include color values such that each point of the modified three-dimensional human model 6204 has a different assigned color value. Accordingly, different gestures may produce different dense representations.

In one or more embodiments, the scene-based image editing system 106 utilizes the generator neural network 6212 to generate a modified two-dimensional image 6214 from the modified three-dimensional human model 6204. For example, the scene-based image editing system 106 provides the texture map 6208 and the intermediate representation 6210 to the generator neural network 6212 to generate a modified two-dimensional image 6214. In some embodiments, the scene-based image editing system 106 also provides the two-dimensional image 6200 (or additional intermediate representations of the pose of the two-dimensional human in the two-dimensional image 6200) to the generator neural network 6212.

The scene-based image editing system 106 utilizes the generator neural network 6212 to generate a modified two-dimensional image 6214 to include a two-dimensional human being re-posed according to the target pose indicated by the intermediate representation 6210 and the texture map 6208. To illustrate, the generator neural network 6212 predicts the pose and position of a two-dimensional human in the modified two-dimensional image 6214. Further, in one or more embodiments, the generator neural network 6212 generates one or more textures of one or more portions of the two-dimensional human and/or background in the modified two-dimensional image 6214 based on the context information provided by the two-dimensional image 6200 and/or texture map 6208.

In one or more embodiments, the scene-based image editing system 106 utilizes a generator neural network as described below: U.S. patent application Ser. No. 18/190,636, entitled "SYNTHESIZING A MODIFIED DIGITAL IMAGE UTILIZING A REPOSING MODEL" (Synthesis of modified digital images Using a re-pose model), filed on 3/2023, 27, which is incorporated herein by reference in its entirety. In particular, the scene-based image editing system 106 utilizes the generator neural network 6212 to generate a modified two-dimensional image 6214 via features extracted from the two-dimensional image 6200. For example, the scene-based image editing system 106 utilizes the generator neural network 6212 to modify the pose of a two-dimensional human from local features associated with the two-dimensional human while maintaining global features identified in the two-dimensional image 6200. Thus, the scene-based image editing system 106 provides modified two-dimensional human visual features according to target poses within the scene of the two-dimensional image 6200.

In one or more embodiments, scene-based image editing system 106 provides a tool within a graphical user interface for modifying human gestures in a two-dimensional image via a three-dimensional representation. Furthermore, scene-based image editing system 106 provides tools within the graphical user interface for generating a modified two-dimensional image based on the human modification pose. Fig. 63A-63G illustrate a graphical user interface of a client device for modifying a two-dimensional image by gesture modification of a three-dimensional representation of a two-dimensional human in the two-dimensional image.

Fig. 63A shows a graphical user interface of a client application at a client device. For example, the client application includes a digital image editing application for performing various image editing tasks. In one or more embodiments, the client device displays a two-dimensional image 6300 that includes a scene involving a two-dimensional human 6302. Specifically, as shown, the two-dimensional image 6300 includes a two-dimensional human 6302 in the context of various objects.

In one or more embodiments, the scene-based image editing system 106 utilizes one or more neural networks to detect and extract the two-dimensional human 6302 from the two-dimensional image 6300. For example, scene-based image editing system 106 extracts two-dimensional human 6302 by generating an image mask for pixels comprising two-dimensional human 6302. To illustrate, the scene-based image editing system 106 utilizes a neural network trained to detect humans in digital images. Further, in some embodiments, scene-based image editing system 106 utilizes image masking to generate a cropped image that includes two-dimensional human 6302 (e.g., for storage in memory while performing one or more operations on two-dimensional image 6300).

In accordance with one or more embodiments, the scene-based image editing system 106 utilizes the cropped image to generate a three-dimensional representation of the two-dimensional human 6302. In particular, fig. 63B illustrates the scene-based image editing system 106 generating a three-dimensional human model 6304 representing a two-dimensional human 6302 in a two-dimensional image 6300. For example, the scene-based image editing system 106 generates the three-dimensional human model 6304 as an overlay display on the two-dimensional image 6300 within the graphical user interface. To illustrate, the scene-based image editing system 106 generates a three-dimensional human model 6304 based on the detected pose of the two-dimensional human 6302 (e.g., using a plurality of neural networks as described above), and displays the three-dimensional human model 6304 on top of the two-dimensional human 6302 within a graphical user interface.

In some embodiments, the client device displays the three-dimensional human model 6304 at a location within the graphical user interface corresponding to the two-dimensional human 6302 of the two-dimensional image 6300. In particular, the scene-based image editing system 106 places the three-dimensional human model 6304 at a location based on a mapping of features from the two-dimensional human 6302 to the three-dimensional human model 6304 in the two-dimensional image 6300. For example, the scene-based image editing system 106 places the three-dimensional human model 6304 based on detected features of the two-dimensional human 6302 (e.g., from texture maps) that correspond to portions of the three-dimensional human model 6304. In some embodiments, scene-based image editing system 106 determines coordinates for placement of three-dimensional human model 6304 from the image space of two-dimensional image 6300.

In at least some embodiments, the scene-based image editing system 106 provides the three-dimensional human model 6304 for display within a texture-free graphical user interface. To illustrate, the client device displays a three-dimensional human model 6304 with a default texture (e.g., a solid color such as gray). Alternatively, the client device displays a three-dimensional human model 6304 with textures based on the two-dimensional human 6302 in fig. 63A. For example, the scene-based image editing system 106 generates an estimated texture in response to modifications to the three-dimensional human model 6304 and displays the estimated texture on the three-dimensional human model 6304.

In accordance with one or more embodiments, scene-based image editing system 106 provides a tool for modifying the pose of two-dimensional human 6302 via the pose of three-dimensional human model 6304. For example, scene-based image editing system 106 provides one or more tools for modifying the pose of three-dimensional human model 6304 in response to selection of the pose modification tool. Alternatively, scene-based image editing system 106 provides one or more tools for modifying the pose of three-dimensional human model 6304 in response to a contextual determination of intent associated with two-dimensional image 6300. To illustrate, scene-based image editing system 106 provides a tool for modifying the pose of three-dimensional human model 6304 in response to detecting two-dimensional human 6302 in two-dimensional image 6300. In some embodiments, the scene-based image editing system 106 provides a tool in response to selecting the two-dimensional human 6302 in the two-dimensional image 6300 via the graphical user interface.

FIG. 63C illustrates a client device displaying one or more graphical elements indicating modifiable portions of a three-dimensional human model. Specifically, as shown, the client device displays a three-dimensional human model 6306, the three-dimensional human model 6306 including a plurality of points indicative of modifiable joints in the three-dimensional human model 6306. For example, the client device displays selectable elements for each interaction point (e.g., each joint) in the three-dimensional human model 6306. In one or more embodiments, scene-based image editing system 106 determines points to display with three-dimensional human model 6306 based on joint or other pose information in the three-dimensional skeleton corresponding to three-dimensional human model 6306.

In one or more additional embodiments, the scene-based image editing system 106 also provides a tool for viewing the projection of a two-dimensional image in three-dimensional space. For example, fig. 63D illustrates a three-dimensional representation 6308 of the two-dimensional image 6300 of fig. 63A generated by the scene-based image editing system 106 in three-dimensional space. In particular, the scene-based image editing system 106 generates a three-dimensional grid (e.g., via one or more neural networks) that represents the content of the two-dimensional image 6300. To illustrate, the scene-based image editing system 106 generates a three-dimensional representation 6308 that includes depth displacement information based on one or more foreground objects and/or one or more background objects in the two-dimensional image 6300. Thus, the scene-based image editing system 106 generates a three-dimensional representation 6308 that includes portions corresponding to the two-dimensional human 6302 shown in fig. 63A.

In accordance with one or more embodiments, the scene-based image editing system 106 generates a three-dimensional human model 6306a associated with a two-dimensional human (e.g., as shown in fig. 63 d). Furthermore, the scene-based image editing system 106 locates the three-dimensional human model 6306a within the three-dimensional space based on the location of the portion of the three-dimensional representation 6308 that corresponds to the two-dimensional human. To illustrate, the scene-based image editing system 106 inserts the three-dimensional human model 6306a at the same location as the portion of the three-dimensional representation 6308 corresponding to the two-dimensional human, or before the portion of the three-dimensional representation 6308 corresponding to the two-dimensional human (e.g., relative to the camera location). In some embodiments, as shown, the client device modifies the displayed two-dimensional image to show the three-dimensional representation 6308, such as in response to user input rotating the three-dimensional representation 6308 within the three-dimensional space. In an alternative embodiment, scene-based image editing system 106 hides three-dimensional representation 6308 within a graphical user interface while modifying two-dimensional image 6300 using three-dimensional space.

In one or more embodiments, the client device displays one or more additional interactive elements 6300 having a two-dimensional image in response to interaction with the point via a graphical user interface of the client device. For example, fig. 63E shows a second view of the client device displaying a three-dimensional representation 6308A of the two-dimensional image 6300 of fig. 63A. In response to selecting the point displayed on the three-dimensional human model 6306b of the two-dimensional human 6302 representing the two-dimensional image 6300, the scene-based image editing system 106 displays the interactive element 6310. In particular, the client device displays an interactive element 6310 that includes one or more axes for changing rotation of the selected joint within the three-dimensional space in response to one or more inputs. To illustrate, the scene-based image editing system 106 utilizes inverse kinematics to change the position and/or rotation of one or more portions of the three-dimensional human model 6306b in response to one or more interactions with the interaction element 6310 at selected points.

In accordance with one or more embodiments, scene-based image editing system 106 also utilizes one or more constraints to determine a modification pose of three-dimensional human model 6306 b. In particular, the scene-based image editing system 106 determines one or more motion constraints for a portion (e.g., a selected joint) of the three-dimensional human model 6306b based on one or more pose priors corresponding to the portion. For example, the scene-based image editing system 106 determines one or more angles of rotation of the hip of the three-dimensional human model 6306b based on typical hip angles of rotation. Thus, the scene-based image editing system 106 limits rotation of one or more legs of the three-dimensional human model 6306b based on the motion constraints associated with the hip joint.

In one or more embodiments, scene-based image editing system 106 provides additional tools for re-pose three-dimensional human model 6306b based on a pre-built pose library. For example, the client device displays a list of gestures from which the user may select. In response to the selection of the pose, scene-based image editing system 106 modifies the pose of three-dimensional human model 6306b based on the selected pose. To illustrate, the scene-based image editing system 106 obtains bone position and joint information (e.g., rotation/angle) from the selected pose and modifies the three-dimensional bone of the three-dimensional human model 6306b based on the obtained bone position and joint information.

Fig. 63F illustrates a client device displaying a three-dimensional representation 6308b, which includes a modified three-dimensional human model 6312 that includes a modified gesture. In particular, the scene-based image editing system 106 modifies one or more portions of the three-dimensional human model in response to one or more interactions with one or more interactive elements (e.g., interactive element 6310a corresponding to a particular portion of the modified three-dimensional human model 6312). Thus, as illustrated, the scene-based image editing system 106 modifies a three-dimensional human model representing a two-dimensional human in a two-dimensional image in accordance with one or more rotations and/or positional changes of one or more portions of the three-dimensional human model. Scene-based image editing system 106 provides modified three-dimensional human model 6312 for display at the client device based on the gesture modification input.

In one or more embodiments, scene-based image editing system 106 also provides a depiction of changes in the pose of the three-dimensional human model in conjunction with one or more pose modification inputs. For example, scene-based image editing system 106 modifies the pose of the three-dimensional human model displayed at the client device with the pose modification input. To illustrate, the scene-based image editing system 106 determines a range of motion of one or more portions of the three-dimensional human model from an initial pose of the three-dimensional human model (e.g., as shown in fig. 63E) and a target pose of the three-dimensional human model (e.g., as shown in fig. 63F). The scene-based image editing system 106 displays a range of motion of one or more portions of the three-dimensional human model within the graphical user interface (e.g., by following a cursor or touch input that moves or rotates a portion of the three-dimensional human model).

In some embodiments, scene-based image editing system 106 also updates the two-dimensional human in conjunction with the update to the three-dimensional human model. For example, the scene-based image editing system 106 determines respective ranges of motion for one or more portions of the two-dimensional human corresponding to one or more portions of the three-dimensional human model. To illustrate, the scene-based image editing system 106 determines that the gesture modification input modifies a portion of the three-dimensional human model. The scene-based image editing system 106 determines the corresponding portion of the two-dimensional human and updates the pose of the two-dimensional human in real-time based on modifications to the three-dimensional human model. In an alternative embodiment, scene-based image editing system 106 updates two-dimensional human 6312 in the two-dimensional image at predetermined time intervals in response to a commit action or based on the modified three-dimensional human model.

Fig. 63G illustrates a client device displaying a modified two-dimensional image 6314 based on the modified three-dimensional human model 6312 of fig. 63F. Specifically, the scene-based image editing system 106 utilizes a neural network to generate a modified two-dimensional image 6314 based on the two-dimensional image 6300 (e.g., as shown in fig. 63A) and the modified three-dimensional human model 6312. For example, scene-based image editing system 106 generates modified two-dimensional human 6316 to include the modified pose based on the modified pose of modified three-dimensional human model 6132. Further, in one or more embodiments, scene-based image editing system 106 generates one or more updated textures of the two-dimensional human from the modified pose (e.g., based on the initial texture map of the two-dimensional human). To illustrate, the scene-based image editing system 106 generates updated pixel values for previously invisible portions of a two-dimensional human (e.g., previously hidden portions of an arm or leg) or modifies the texture of clothing based on the modified pose. In response to determining that the one or more portions of the background are revealed in response to the modified gesture, the scene-based image editing system 106 also generates one or more repair portions corresponding to the background behind the two-dimensional human.

In one or more embodiments, as described above, the scene-based image editing system 106 generates a modified two-dimensional image by re-pose a two-dimensional human in the two-dimensional image. FIG. 64 illustrates a digital image associated with modifying a pose of a two-dimensional human in a two-dimensional image. In particular, fig. 64 shows a first two-dimensional image 6400 including a two-dimensional human 6402 having an initial pose. FIG. 64 also generates a second two-dimensional image 6404 in response to modifying the pose of a corresponding three-dimensional human model representing a two-dimensional human 6402. In particular, the second two-dimensional image 6404 includes a modified two-dimensional human 6406 based on the modified pose. Further, fig. 64 shows an intermediate representation 6408 generated by scene-based image editing system 106 based on a modified pose of a three-dimensional human model representing a two-dimensional human 6402.

In additional embodiments, the scene-based image editing system 106 provides tools for performing additional operations on two-dimensional humans in two-dimensional images via three-dimensional representations. In accordance with one or more embodiments, as previously described, the scene-based image editing system 106 provides a tool for modifying clothing of a two-dimensional human being from a three-dimensional representation generated by the scene-based image editing system 106 for the two-dimensional human being. Fig. 65 shows a two-dimensional image 6500 comprising a two-dimensional human being and a modified two-dimensional image 6502. In particular, the scene-based image editing system 106 generates a modified two-dimensional image 6502 to alter the pattern of clothing on the two-dimensional human in response to interaction with the three-dimensional human model representing the two-dimensional human in the two-dimensional image 6500 (e.g., by modifying a texture map of the three-dimensional human model representing the two-dimensional human in the two-dimensional image 6500).

Although FIG. 65 shows the scene-based image editing system 106 modifying a two-dimensional human in a two-dimensional image by changing the texture of clothing of the two-dimensional human, the scene-based image editing system 106 alternatively modifies the two-dimensional human by determining interactions between one or more objects in the two-dimensional human. For example, the scene-based image editing system 106 determines interactions between three-dimensional objects in a three-dimensional space corresponding to a two-dimensional image and a three-dimensional human model representing a two-dimensional human. To illustrate, the scene-based image editing system 106 provides tools for interacting with objects in a scene, including three-dimensional human models representing two-dimensional humans, and determining how these interactions affect other objects in the scene.

As an example, the scene-based image editing system 106 determines interactions between the re-posed three-dimensional human model and one or more clothing objects. In particular, modifying the pose of the three-dimensional human model may affect the position, shape, or other properties of the garment (e.g., a hat or shirt), which allows the scene-based image editing system 106 to provide a tool for fitting new garments, etc. In additional examples, the scene-based image editing system 106 determines interactions between the re-posed three-dimensional human model and one or more background objects. To illustrate, modifying the pose of the three-dimensional human model may cause a portion of the three-dimensional human model to contact a background object. In one or more embodiments, scene-based image editing system 106 determines whether such interactions occur based on a three-dimensional representation of a scene in a two-dimensional image, and imposes constraints on the pose of the three-dimensional human model (e.g., by preventing the human arms or legs from intersecting furniture) based on the interactions.

In additional embodiments, the scene-based image editing system 106 utilizes the generated three-dimensional human model representing a two-dimensional human in a two-dimensional image to perform one or more additional operations based on illumination in the two-dimensional image. For example, the scene-based image editing system 106 utilizes a three-dimensional representation of a scene of a two-dimensional image to render shadows in connection with modifying the content of the two-dimensional image. To illustrate, as previously described, in response to a re-pose and/or moving a three-dimensional human model within a three-dimensional space, the scene-based image editing system 106 utilizes the three-dimensional human model to generate realistic shadows of the two-dimensional image.

In accordance with one or more embodiments, the scene-based image editing system 106 also provides a tool for understanding the three-dimensional positioning of objects within a three-dimensional space. In particular, in one or more embodiments, the scene-based image editing system 106 utilizes a three-dimensional understanding of a scene of a two-dimensional image to determine the relative position of objects within the scene. In addition, the scene-based image editing system 106 utilizes a three-dimensional understanding of the scene to generate and display a planar surface (planar surface) that is relevant to the selection of objects in the image. Accordingly, the scene-based image editing system 106 provides an improved graphical user interface for modifying objects within a three-dimensional space to better visually understand the positioning of objects within a scene.

FIG. 66 illustrates an overview of a planar surface generated and displayed by the scene-based image editing system 106 for modifying objects in a three-dimensional representation of a scene. In particular, FIG. 66 illustrates a scene-based image editing system generating a planar surface corresponding to a location of an object within a three-dimensional scene in response to selection of the object. The scene-based image editing system 106 maps a planar surface to an object such that modifications to the object within the three-dimensional space (e.g., modifications to the position of the object) result in modifying the planar surface as displayed within the graphical user interface.

In one or more embodiments, the scene-based image editing system 106 generates or otherwise determines a three-dimensional scene 6600 that includes a plurality of objects. For example, the scene-based image editing system 106 optionally utilizes one or more neural networks to extract content from the two-dimensional image 6602 and generate a three-dimensional scene 6600 that includes objects in the two-dimensional image 6602. To illustrate, the scene-based image editing system 106 generates one or more three-dimensional grids representing one or more objects (e.g., foreground and/or background objects) in the two-dimensional image 6602 in three-dimensional space. In particular, the scene-based image editing system 106 generates one or more foreground three-dimensional grids representing one or more foreground objects in the two-dimensional image and a background three-dimensional grid representing a background in the two-dimensional image. In an alternative embodiment, scene-based image editing system 106 determines three-dimensional scene 6600, including a three-dimensional grid representing any set of objects within a three-dimensional space (e.g., a three-dimensional grid generated via a three-dimensional model application).

According to one or more embodiments, scene-based image editing system 106 determines selected object 6604 from three-dimensional scene 6600. In particular, scene-based image editing system 106 determines to select an object from a plurality of objects within three-dimensional scene 6600 in response to an input indicating a selected object 6604. For example, scene-based image editing system 106 determines selected object 6604 in conjunction with a request to modify selected object 6604 within three-dimensional scene 6600. In some embodiments, scene-based image editing system 106 determines a plurality of selected objects in three-dimensional scene 6600 (e.g., in conjunction with a request to modify the plurality of selected objects together).

In one or more additional embodiments, scene-based image editing system 106 generates planar surface 6606 based on selected object 6604 (or a set of selected objects that includes selected object 6604). In particular, the scene-based image editing system 106 determines the three-dimensional position of the selected object 6604 (or a portion of the selected object 6604) within a three-dimensional space relative to one or more axes within the three-dimensional space. Further, the scene-based image editing system 106 generates a planar surface 6606 from the three-dimensional position of the selected object 6604. The scene-based image editing system 106 also provides a planar surface 6606 for display within a graphical user interface of the client device in connection with modifying selected objects 6604 in the three-dimensional scene 6600.

In one or more embodiments, scene-based image editing system 106 utilizes planar surface 6606 to provide a visual indication of movement of selected object 6604 within three-dimensional space. In particular, modifying objects within a three-dimensional space under restricted conditions can be restricted and confusing to a user. For example, since a three-dimensional space is represented within a two-dimensional editing space, modifying objects in a scene according to a three-dimensional understanding of the scene given a fixed camera position (e.g., a fixed editing viewpoint) can be challenging. To illustrate, accurately representing the relative three-dimensional position of an object within a two-dimensional editing space may be challenging, depending on the size of the object and/or the position of the selected object relative to the camera view or horizon. The scene-based image editing system 106 facilitates understanding three-dimensional characteristics of a scene by providing a planar guide, thereby providing improved object interaction within a two-dimensional editing space.

In accordance with one or more embodiments, the scene-based image editing system 106 utilizes planar surfaces displayed within a two-dimensional editing space to indicate the three-dimensional position of a particular object within a three-dimensional space. To illustrate, the scene-based image editing system 106 displays and moves planar surfaces within a three-dimensional space in conjunction with movement of corresponding objects. Thus, the scene-based image editing system 106 provides additional visual content related to transforming objects in a three-dimensional scene within a two-dimensional editing space. Some conventional image editing systems provide directions or visual axes to assist a user in editing digital content in an editing space, which may be difficult to interpret from a single point of view. In contrast to these conventional systems, the scene-based image editing system 106 utilizes three-dimensional understanding of objects in a scene in conjunction with planar surfaces to provide intuitive and continuous transformation/movement of objects via a two-dimensional editing space of a fixed camera. To illustrate, by providing a transformation plane related to modifying objects in three-dimensional space, scene-based image editing system 106 provides a visual indication of forward-backward or up-down movement in a three-dimensional scene based on movement in image space.

FIG. 67 shows a diagram including additional details associated with the scene-based image editing system 106, which scene-based image editing system 106 generates and displays planar surfaces for transforming objects in a three-dimensional scene. Specifically, fig. 67 shows a three-dimensional scene 6700 that includes a plurality of objects within a three-dimensional space. For example, as previously described, the scene-based image editing system 106 determines a three-dimensional scene 6700 that includes a three-dimensional representation of a two-dimensional scene in a two-dimensional image (e.g., by generating a three-dimensional mesh for objects in the two-dimensional scene). Thus, the scene-based image editing system 106 determines three-dimensional characteristics (e.g., three-dimensional shapes and/or coordinates of objects in a scene within a three-dimensional space).

In one or more embodiments, as shown in FIG. 67, scene-based image editing system 106 determines selected object 6702 (or a selected group of objects) from three-dimensional scene 6700. In particular, in response to user input selecting an object in the three-dimensional scene 6700, the scene-based image editing system 106 receives an indication of the selected object 6702. Alternatively, scene-based image editing system 106 determines selected object 6702 in response to an automated process of selecting objects in three-dimensional scene 6700 (e.g., in response to utilizing one or more neural networks to infer intent related to editing a digital image). In some embodiments, scene-based image editing system 106 highlights selected object 6702 within a graphical user interface displaying three-dimensional scene 6700.

According to one or more embodiments, the scene-based image editing system 106 determines a particular portion of the selected object 6702 (or the selected group of objects) to generate a graphical element that indicates the position of the selected object 6702 in three-dimensional space. For example, fig. 67 shows an indication of an object portion 6704 of a selected object 6702. To illustrate, the scene-based image editing system 106 determines portions of the selected object 6702 based on the position of the selected object 6702 along one or more axes in three-dimensional space. In some embodiments, the scene-based image editing system 106 identifies a vertex or set of vertices corresponding to a minimum value along a particular axis, such as the lowest portion along the z-axis or vertical axis or the bottom of the camera view of the selected object 6702 relative to the three-dimensional scene 6700.

As shown in fig. 67, the scene-based image editing system 106 also determines a three-dimensional position value 6706 corresponding to the object portion 6704 within the three-dimensional space. In particular, the scene-based image editing system 106 identifies three-dimensional coordinates within a three-dimensional space corresponding to the location of the object portion 6704. To illustrate, in response to identifying a lowest portion (e.g., bottom) of the selected object 6702 along a particular axis or set of axes, the scene-based image editing system identifies three-dimensional coordinates of the lowest portion. In at least some embodiments, the three-dimensional position value 6706 includes a minimum value of the object portion 6704 along the vertical axis. In an alternative embodiment, the three-dimensional position values 6706 correspond to different portions of the selected object 6702, such as the top, center, or centroid of the selected object 6702.

In accordance with one or more embodiments, in response to determining the three-dimensional position value 6706, the scene-based image editing system 106 generates a planar surface 6708 corresponding to the three-dimensional position value 6706. In particular, scene-based image editing system 106 generates planar surface 6708 along two or more axes within three-dimensional space. For example, the scene-based image editing system 106 generates an infinite plane or partial plane intersecting the three-dimensional position values 6706 along two or more axes. To illustrate, the scene-based image editing system 106 generates the planar surface 6708 at three-dimensional position values along the x-axis and the y-axis such that the planar surface 6708 has the same z-axis value (e.g., a constant height along the planar surface 6708). In an alternative embodiment, scene-based image editing system 106 generates planar surface 6708 along more than two axes, such as along a flat, angled surface within three-dimensional scene 6700 (e.g., along a surface of object portion 6704).

In at least some embodiments, the scene-based image editing system 106 generates planar surfaces 6708 at particular locations within the three-dimensional space based on, but not intersecting, the three-dimensional position values 6706. For example, the scene-based image editing system 106 determines additional three-dimensional position values at a distance from the three-dimensional position values 6706 along one or more axes in three-dimensional space. To illustrate, the scene-based image editing system 106 determines additional three-dimensional position values by applying predetermined displacement values to the three-dimensional position values 6706 along a vertical axis. In some embodiments, scene-based image editing system 106 generates planar surface 6708 on the surface of the additional object at a particular distance from three-dimensional position value 6706.

In addition to generating planar surface 6708, scene-based image editing system 106 also generates texture 6710 of planar surface 6708. In particular, scene-based image editing system 106 generates texture 6710 to indicate that planar surface 6708 is a visual element that helps modify objects within the three-dimensional space. In one or more embodiments, the scene-based image editing system 106 generates a partially transparent texture for displaying the planar surface 6708 while also providing a visual indication of objects/content behind or below the planar surface 6708 from the perspective of the camera view in the graphical user interface. Furthermore, in some embodiments, scene-based image editing system 106 generates a plurality of textures for planar surface 6708 (e.g., for different portions of planar surface 6708 or for different states of planar surface 6708).

In accordance with one or more embodiments, scene-based image editing system 106 displays planar surface 6708 with three-dimensional scene 6700 within a graphical user interface. In particular, FIG. 67 illustrates that scene-based image editing system 106 provides a displayed planar surface 6712 within a graphical user interface. For example, the scene-based image editing system 106 provides textures 6710 to the displayed planar surface 6712 at three-dimensional position values 6706 within a three-dimensional space that includes the three-dimensional scene 6700 or at additional three-dimensional position values that are based on the three-dimensional position values 6706. Thus, the scene-based image editing system 106 provides an infinite plane or partial plane within the graphical user interface for providing additional information associated with the relative positioning of one or more objects within the three-dimensional space.

In one or more embodiments, the scene-based image editing system 106 also provides a tool for modifying the selected object 6702 within the three-dimensional space. For example, the scene-based image editing system 106 provides a tool for modifying the position of the selected object 6702 within three-dimensional space. As shown in FIG. 67, in response to modifying the position of selected object 6702, scene-based image editing system 106 provides modified object 6714 within the graphical user interface. In particular, scene-based image editing system 106 provides modified object 6714 at a new location and modifies the location or visual characteristics of displayed planar surface 6712 in response to modifying the location of selected object 6702. To illustrate, in response to modifying the position of the selected object 6702 along one or more axes, the scene-based image editing system 106 causes the displayed planar surface 6712 to follow the corresponding object portion 6704 along one or more axes.

As described above, the scene-based image editing system 106 provides a tool for modifying objects within a graphical user interface through the use of planar surfaces. In particular, scene-based image editing system 106 utilizes planar surfaces in conjunction with one or more additional elements within a graphical user interface to provide an accurate and intuitive three-dimensional representation of object localization within a three-dimensional scene. For example, in one or more embodiments, the scene-based image editing system 106 generates a three-dimensional representation of a two-dimensional scene in a two-dimensional image. In some embodiments, the scene-based image editing system 106 utilizes planar surfaces to show the relative three-dimensional positioning of objects within the scene while maintaining a fixed camera viewpoint (e.g., two-dimensional visualization of planar surfaces in a two-dimensional editing interface).

Fig. 68A-68 DE illustrate a graphical user interface of a client device for generating and displaying planar surfaces related to transforming one or more objects in a three-dimensional scene 6800. In one or more embodiments, the client device includes a mobile device (e.g., a smartphone or tablet), a desktop device, or a laptop device. Further, in one or more embodiments, as shown in FIG. 68A, the scene-based image editing system 106 generates a three-dimensional scene 6800 based on a two-dimensional scene in a two-dimensional image. For example, the scene-based image editing system 106 generates a three-dimensional scene 6800 based on two-dimensional objects in a two-dimensional image to include one or more three-dimensional meshes having vertices and edges in three-dimensional space. In an alternative embodiment, three-dimensional scene 6800 includes one or more objects generated via a three-dimensional editing application.

According to one or more embodiments, the client device detects, via a graphical user interface, interactions 6800 with objects in the three-dimensional scene. For example, the client device provides tools 6800 for moving, adding, removing, or otherwise interacting with objects in a three-dimensional scene. In some embodiments, the client device detects input interacting with objects in the three-dimensional scene 6800 via a touch screen interface and/or via a mouse device. Additionally, in one or more embodiments directed to generating a three-dimensional scene 6800 from a two-dimensional image, a client device provides a two-dimensional editing space for modifying objects within the two-dimensional image according to three-dimensional characteristics of the detected two-dimensional image. To illustrate, the scene-based image editing system 106 provides three-dimensional editing of objects in a two-dimensional scene.

In one or more embodiments, the client device detects 6800a selection of an object in the three-dimensional scene. For example, as shown in fig. 68B, the client device receives input via a graphical user interface indicating a selected object 6802a within the three-dimensional scene 6800 a. Specifically, as shown, the client device determines an intent to transform or modify a selected object 6802a within the three-dimensional scene 6800 a. To illustrate, the client device determines an indication or inferred intent to modify the position of the selected object 6802a within the three-dimensional scene 6800 a.

In conjunction with determining an intent to modify the selected object 6802a, the scene-based image editing system 106 generates a planar surface 6804a. In particular, the scene-based image editing system 106 generates a planar surface 6804a at a particular location within the three-dimensional scene 6800a corresponding to a portion of the selected object 6802 a. For example, the scene-based image editing system 106 determines the bottom of the selected object 6802a (e.g., one or more vertices of the sofa leg at the lowest position along the z-axis in fig. 68B). The scene based image editing system 106 generates a planar surface 6804a from the determined portion of the selected object 6802a for display on the client device.

In accordance with one or more embodiments, as previously described, scene-based image editing system 106 generates planar surface 6804a comprising a partially transparent texture. For example, the client displays the planar surface 6804a in a three-dimensional space having a partially transparent texture such that one or more objects (e.g., one or more background objects) behind or below the planar surface 6804a are at least partially visible. To illustrate, the client device displays a planar surface 6804a under the selected object 6802a at a particular three-dimensional position value along a vertical axis within the three-dimensional scene 6800 a.

In one or more embodiments, scene-based image editing system 106 also generates object platform 6806a corresponding to selected object 6802a with respect to planar surface 6804a. Specifically, the scene-based image editing system 106 generates the object platform 6806a by determining the position of the selected object 6802a relative to the planar axis of the planar surface 6804a. For example, the scene-based image editing system 106 determines three-dimensional position values of one or more edge plane axes of the selected object 6802 a. Further, the scene-based image editing system 106 generates the object platform 6806a based on three-dimensional position values of one or more edges of the selected object 6802a, for example, by generating a bounding box indicating a horizontal positioning of the selected object 6802a on the planar surface 6804a.

In some embodiments, scene-based image editing system 106 generates separate textures for object platform 6806a. In particular, scene-based image editing system 106 generates a first texture for a portion of planar surface 6804a external to object platform 6806a and a second texture for a portion of planar surface 6804a within object platform 6806a. To illustrate, scene-based image editing system 106 applies a different transparency to object platform 6806a than the rest of planar surface 6804 a. Alternatively, scene-based image editing system 106 generates a contour or bounding box for object platform 6806a for display at the client device.

In one or more embodiments, scene-based image editing system 106 modifies planar surfaces in connection with transforming selected objects in a three-dimensional scene. Fig. 68C illustrates a graphical user interface for modifying a three-dimensional scene 6800b that includes a selected object 6802 b. In particular, in response to modifying the position of the selected object 6802b along a planar axis (e.g., in a horizontal direction) corresponding to the planar surface 6804b displayed at the client device, the scene-based image editing system 106 modifies the planar surface 6804b. For example, scene-based image editing system 106 changes the position of object platform 6806b based on modifications to selected object 6802b, such as by moving object platform 6806b to stay under selected object 6802b in response to moving selected object 6802b from a first position to a second position. By modifying the position of the object platform 6806b based on the position of the selected object 6802b, the scene-based image editing system provides a three-dimensional understanding of the relative position of the object in the three-dimensional scene 6800b within the two-dimensional editing space of the graphical user interface of the client device. In some embodiments, as the object platform 6806b is moved, the scene-based image editing system 106 also modifies the visual characteristics of the object platform 6806 b.

In one or more additional embodiments, the scene-based image editing system 106 provides a tool for modifying the position of objects within a three-dimensional scene perpendicular to a planar surface. In particular, fig. 68D illustrates a graphical user interface for a client device modifying a plane having objects in a three-dimensional scene 6800 c. For example, fig. 68D illustrates movement of the selected object 6802c in a vertical direction relative to a planar surface 6804c generated to illustrate transformation of the selected object 6802 c. More specifically, the client device displays movement of the selected object 6802c in an upward direction relative to a floor surface within the three-dimensional scene 6800 c.

In one or more embodiments, the client device moves the selected object 6802c and the planar surface 6804c in a direction perpendicular to the planar surface 6804c in response to an input selecting and moving the selected object 6802c from the first position to the second position. In an alternative embodiment, the client device moves the selected object 6802c and planar surface 6804c in a direction perpendicular to the planar surface 6804c from the first position to the second position in response to an input selecting and modifying the position of the planar surface 6804c. Thus, in one or more embodiments, the client device moves the selected object 6802c or the planar surface 6804c together in response to movement of the selected object 6802c or the planar surface 6804c. As shown, in some embodiments, in response to moving the selected object 6802c, the client device displays at least a portion of the planar surface 6804c having a partially or fully transparent texture to display objects behind the planar surface 6804c.

Furthermore, in conjunction with vertically moving the planar surface 6804c, the scene-based image editing system 106 also modifies the perspective of the planar surface 6804c with respect to a camera view (e.g., horizon) associated with the two-dimensional editing space to maintain an accurate perspective of the planar surface 6804c with other content in the three-dimensional scene 6800 c. For example, in connection with generating or otherwise determining the three-dimensional scene 6800c, the scene-based image editing system 106 also determines camera parameters (e.g., position, field of view, pitch, or roll). The scene-based image editing system 106 utilizes the camera parameters to determine a horizon corresponding to the three-dimensional scene 6800 c. The scene-based image editing system 106 utilizes camera parameters to determine how to move and display the planar surface 6804c relative to other content in the three-dimensional scene 6800c, e.g., based on the distance of the location of the planar surface 6804c from the horizon.

In accordance with one or more additional embodiments, the scene-based image editing system 106 provides additional information related to modifying the selected object 6802c within the graphical user interface. In particular, scene-based image editing 106 generates additional planes 6808 (e.g., additional planes) representing distances between selected object 6802c and additional surfaces in three-dimensional scene 6800 c. For example, the client device of fig. 68D displays an additional plane 6808 that is sized based on the proximity of the selected object 6802c (or object platform 6806 c) to the additional surface. More specifically, the scene-based image editing system 106 determines the size of the visible portion of the additional planar surface based on the distance between the selected object 6802c and the additional surface or object.

To illustrate, the client device displays the additional plane 6808 in response to the distance between the selected object 6802c and the additional surface being within a threshold distance. In some embodiments, the client device increases the size of the additional plane 6808 as the selected object 6802c becomes closer to the additional surface (e.g., based on vertical movement of the selected object 6802 c), and decreases the size of the additional plane 6808 as the selected object 6802c moves away from the additional surface. Further, in some embodiments, the client device conceals the additional plane 6808 in response to the distance between the selected object 6802c and the additional surface exceeding a threshold distance.

Further, in some embodiments, scene-based image editing system 106 generates one or more additional planes for multiple surfaces of a three-dimensional scene based on the proximity of objects moving within the three-dimensional scene. Fig. 68E illustrates a three-dimensional scene 6800d that includes a selected object 6802d moving within the three-dimensional scene 6800d. In particular, in response to movement of the selected object 6802d within the three-dimensional scene 6800d (and corresponding movement of the planar surface 6804d and/or the object platform 6806 d), the scene-based image editing system 106 generates additional planes 6810 on additional surfaces (e.g., walls) in the three-dimensional scene 6800d. For example, as the selected object 6802d moves horizontally along one or more axes, the scene-based image editing system 106 displays (or hides) the additional plane 6810 in a corresponding size based on the distance between the selected object 6802d and the additional surface. In other embodiments, scene-based image editing system 106 determines whether to display an additional plane based on the movement of the direction of selected object 6802c and/or the size of the additional surface.

In one or more embodiments, scene-based image editing system 106 displays planar surfaces having a limited size in connection with modifying objects in a three-dimensional scene. For example, while fig. 68A-68E illustrate the scene-based image editing system 106 generating an infinite (or large) planar surface corresponding to the position of the selected object in the three-dimensional scene, the scene-based image editing system 106 alternatively generates a planar surface based on the size of the selected object. For illustration, fig. 69A-69B illustrate a graphical user interface displaying a planar surface having a size and position based on the size and position of a selected object.

FIG. 69A illustrates a three-dimensional scene 6900a that includes a plurality of objects within a graphical user interface of a client device. For example, the client device displays the selected object 6902a within the three-dimensional scene 6900a. Further, the client device displays a planar surface 6904a that includes a location based on a portion of the selected object 6902a. Additionally, in one or more embodiments, scene-based image editing system 106 generates planar surface 6904a comprising a limited size based on the size of selected object 6902a. To illustrate, scene-based image editing system 106 generates planar surface 6904a from an object platform corresponding to the size of selected object 6902a. In one or more embodiments, the scene-based image editing system 106 generates an infinite planar surface while displaying only a portion of the planar surface, such as an object platform portion of the planar surface.

In some embodiments, scene-based image editing system 106 generates planar surfaces having different shapes based on the shape of the selected object or other criteria. As shown in fig. 69A, scene-based image editing system 106 generates planar surface 6904a based on the shape of selected object 6902 a. Alternatively, the scene-based image editing system 106 generates a planar surface according to default settings, the shape of the surface on which the selected object is located, the visibility of a particular shape, and the like. Fig. 69B illustrates a three-dimensional scene 6900B that includes a selected object 6902B and a planar surface 6904B generated for the selected object 6902B. As shown, planar surface 6904B of fig. 69B includes a different shape than planar surface 6904a of fig. 69A. In additional embodiments, scene-based image editing system 106 generates planar surfaces, including but not limited to circles, squares, rectangles, or other shapes.

In accordance with one or more embodiments, scene-based image editing system 106 provides different textures to a planar surface and/or object platform based on the state of the planar surface and/or the state of the object platform. Fig. 70A-70C illustrate a graphical user interface displaying a three-dimensional scene including a plurality of objects. For example, fig. 70A shows the client device displaying a three-dimensional scene 7000A including the selected object 7002a in a first location. In particular, the client device receives a request to move the selected object 7002a from a first location to a second location.

As shown, scene-based image editing system 106 generates planar surface 7004a from a first location of selected object 7002 a. In one or more embodiments, scene-based image editing system 106 generates planar surface 7004a at a location displaced from selected object 7002 a. For example, as shown in fig. 70A, the scene-based image editing system 106 generates a planar surface 7004a on top of a surface (e.g., a table) under the selected object 7002 a.

In one or more embodiments, the scene-based image editing system 106 provides an option (e.g., a selected tool or mode) to move the selected object 7002a perpendicular to the planar surface 7004a. Thus, in this mode, rather than moving the planar surface 7004a with the selected object 7002a, the scene-based image editing system 106 moves the selected object 7002a along at least one axis without moving the planar surface 7004a. To illustrate, the scene-based image editing system 106 limits movement of the selected object 7002a in a vertical direction relative to the planar surface 7004a. In one or more additional embodiments, scene-based image editing system 106 moves planar surface 7004a as selected object 7002a moves in one or more axes while restricting movement of planar surface 7004a in one or more other axes. For example, the scene-based image editing system 106 moves the planar surface 7004a horizontally (or otherwise along with the planar axis) without moving the planar surface 7004a in a vertical direction.

Fig. 70A shows the client device displaying the planar surface 7004a at a first location having a first texture. For example, in response to determining that planar surface 7004a is stationary (e.g., not moving and in a stationary state), the client device displays planar surface 7004a having the first texture. To illustrate, the client device displays a planar surface 7004a having a planar surface 7004a, the planar texture indicating that the planar surface 7004a is not moving.

Fig. 70B shows the client device displaying a planar surface 7004B having a second texture in conjunction with moving the selected object 7002B to a second location. In particular, in response to determining that planar surface 7004b is moving (e.g., in response to moving selected object 7002b from a first location to a second location), the client device displays planar surface 7004b having a second texture. In one or more embodiments, the second texture includes a texture having contours, irregularities, or other fine-grained details that help illustrate movement of the planar surface 7004b within the three-dimensional scene 7000 b. In some embodiments, scene-based image editing system 106 also modifies the transparency (or other visual characteristics) of planar surface 7004b in response to moving planar surface 7004b.

In one or more additional embodiments, scene-based image editing system 106 moves planar surface 7004b along a surface within three-dimensional scene 7000 b. In particular, scene-based image editing system 106 determines that selected object 7002b is moved in a direction along one or more planar axes (e.g., parallel to planar surface 7004 b) beyond the surface of the additional object. For example, when planar surface 7004b moves out of the surface of the table, scene-based image editing system 106 automatically moves planar surface 7004b to an additional surface (e.g., a floor surface) in three-dimensional scene 7000 b. To illustrate, in embodiments in which planar surface 7004b and selected object 7002b are separated within three-dimensional scene 7000b, scene-based image editing system 106 moves planar surface 7004b along one or more additional surfaces.

Fig. 70C shows a three-dimensional scene 7000C including a plurality of objects, including a set of options associated with transforming the objects in conjunction with a planar surface. In particular, as shown, the client device displays a set of tools for modifying the selected object 7002c within the three-dimensional scene 7000c and/or relative to the planar surface 7004 c. For example, the tool includes a panning option 7006 to capture the location of the selected object 7002c to the surface of an additional object. To illustrate, in response to selection of the pan option 7006, the scene-based image editing system 106 determines the nearest surface in a particular direction (e.g., based on the position of the planar surface 7004c, a selected direction, or a vertical direction below the selected object 7002 c).

The scene-based image editing system 106 captures the selected object 7002c to the nearest surface by translating the selected object to the determined three-dimensional position value corresponding to the nearest surface. For example, scene-based image editing system 106 determines a new position of selected object 7002c by translating selected object 7002c along one or more axes such that selected object 7002c is adjacent to or in contact with a nearest surface without overlapping the nearest surface. In some embodiments, in response to detecting a separation between the selected object 7002c and the planar surface 7004c, the scene-based image editing system 106 captures the selected object 7002c to the planar surface 7004c.

In one or more embodiments, the scene-based image editing system 106 provides one or more additional types of indicators of the three-dimensional position of objects within the three-dimensional space for display within the two-dimensional editing space. For example, fig. 71 shows a graphical user interface of a client device displaying a three-dimensional scene 7100 comprising a plurality of objects. As shown, in response to determining the selected object 7102 within the three-dimensional scene 7100 (e.g., within a two-dimensional image for which the scene-based image editing system 106 has generated the three-dimensional scene 7100), the scene-based image editing system 106 generates a three-dimensional bounding box 7104. Further, moving the selected object 7102 causes the scene-based image editing system 106 to move the three-dimensional bounding box 7104 with the selected object 7102.

In one or more embodiments, the scene-based image editing system 106 generates a three-dimensional bounding box 7104 comprising a plurality of three-dimensional coordinates corresponding to corners of the three-dimensional bounding box 7104. The client device displays the three-dimensional bounding box 7104 by converting the three-dimensional coordinates into two-dimensional coordinates in a two-dimensional editing space (e.g., image space) corresponding to the two-dimensional image. Thus, the three-dimensional bounding box 7104 includes a planar surface at the bottom of the selected object and an additional (transparent) plane according to three-dimensional coordinates.

In additional embodiments, the scene-based image editing system 106 provides additional visual indications based on interactions between objects within the three-dimensional scene. Fig. 72 shows a graphical user interface of a client device displaying a three-dimensional scene 7200 comprising at least a first object 7202 and a second object 7204. In particular, in response to selection of the first object 7202 and input to move the first object 7202 in a direction toward the second object 7204, the scene-based image editing system 106 determines that the first object 7202 intersects the second object 7204. In response to the first object 7202 intersecting the second object 7204, the client device displays the first object 7202 with a modified texture or color to indicate an invalid position of the first object 7202. Thus, the scene-based image editing system 106 utilizes the color/texture of the first object 7202 in conjunction with the planar surface 7206 and/or the object platform 7208 to indicate the position of the first object 7202 within the three-dimensional scene 7200.

In one or more embodiments, scene-based image editing system 106 provides a tool for modifying the focus of a two-dimensional image according to depth values detected in a corresponding three-dimensional scene. Specifically, the scene-based image editing system 106 generates and utilizes a three-dimensional scene of a two-dimensional scene to estimate depth values of content of the two-dimensional scene. Further, the scene-based image editing system 106 provides interface tools for indicating depth values for modifying or setting a focus of a camera associated with a two-dimensional image to modify blur values for portions of the two-dimensional image. In some cases, the scene-based image editing system 106 also provides a tool for selecting portions of the two-dimensional image based on the estimated depth values-e.g., in conjunction with focused and/or blurred portions of the two-dimensional image. In additional embodiments, the scene-based image editing system 106 utilizes the estimated depth values of the two-dimensional image corresponding to the input elements to apply other local image modifications to (e.g., to one or more objects or one or more portions of one or more objects) other local image modifications of particular content of the two-dimensional image, such as color changes, illumination changes, or other transformations.

FIG. 73 illustrates an overview of modifying blur in one or more portions of a two-dimensional image based on a corresponding three-dimensional representation by the scene-based image editing system 106. In particular, FIG. 73 illustrates that the scene-based image editing system provides tools for interacting with a two-dimensional image via graphical user interface elements to indicate focus of the two-dimensional image. The scene-based image editing system 106 utilizes a three-dimensional representation of the two-dimensional image to determine a focus and applies image blur based on the focus.

In one or more embodiments, the scene-based image editing system 106 determines a two-dimensional image 7300 that includes a two-dimensional scene having one or more objects. For example, a two-dimensional scene includes one or more foreground objects and one or more background objects. To illustrate, the two-dimensional image 7300 of fig. 73 includes scenes of a plurality of buildings along a street from a particular perspective. In alternative embodiments, two-dimensional image 7300 includes different scenes of other types of objects, including portrait, panorama, composite image content, and so forth.

In an additional embodiment, scene-based image editing system 106 provides a tool for selecting a new focus in two-dimensional image 7300. In particular, scene-based image editing system 106 provides a tool for indicating the position of input element 7302 within two-dimensional image 7300. For example, the scene-based image editing system 106 determines the input elements 7302 within the two-dimensional image 7300 from the location of a cursor input or touch input within the two-dimensional image 7300. Alternatively, scene-based image editing system 106 determines input elements 7302 within two-dimensional image 7300 from the position of the three-dimensional object inserted into the three-dimensional representation of two-dimensional image 7300 based on user input.

In response to determining the position of input element 7302, scene-based image editing system 106 determines the focus of two-dimensional image 7300. As shown in fig. 73, scene-based image editing system 106 generates modified two-dimensional image 7304 by blurring one or more portions of two-dimensional image 7300. For example, scene-based image editing system 106 determines a focus based on the three-dimensional position of input element 7302 and blurs one or more portions according to their depth differences relative to the focus.

By utilizing the input elements to determine the focus of the two-dimensional image, the scene-based image editing system 106 provides customizable focus modifications of the two-dimensional image. In particular, the scene-based image editing system 102 provides an improved graphical user interface for interacting with a two-dimensional image to modify focus after capturing the two-dimensional image. Unlike conventional systems that provide an option for determining the focus of an image (e.g., focus via a camera lens) when capturing an image, the scene-based image editing system 106 provides focus customization through three-dimensional understanding of a two-dimensional scene. Thus, the scene-based image editing system 106 provides a tool for editing image blur for any two-dimensional image through a three-dimensional representation of the two-dimensional image.

Furthermore, by modifying the focus of the two-dimensional image with a three-dimensional representation of the two-dimensional image, the scene-based image editing system 106 also provides greater accuracy than conventional systems. Unlike conventional systems that apply a blur filter in image space based on selection of portions of a two-dimensional image, the scene-based image editing system 106 utilizes a three-dimensional representation of the two-dimensional image to determine a three-dimensional position in three-dimensional space. Thus, the scene-based image editing system 106 utilizes the three-dimensional position to provide more accurate blurring of portions of the two-dimensional image based on the three-dimensional depth of the portions of the two-dimensional image relative to the focal point.

Fig. 74A-74C illustrate schematic diagrams of a scene-based image editing system 106 modifying a two-dimensional image via a graphical user interface tool that customizes focus for the two-dimensional image. In particular, FIG. 74A illustrates the scene-based image editing system 106 generating a partially blurred two-dimensional image in response to modifying a focus of the two-dimensional image based on input elements. FIG. 74B illustrates the scene-based image editing system 106 modifying a two-dimensional image according to a custom focus via three-dimensional rendering. FIG. 74C illustrates the scene-based image editing system 106 modifying a two-dimensional image according to a custom focus via two-dimensional rendering.

As described above, FIG. 74A illustrates the scene-based image editing system 106 modifying the two-dimensional image 7400 according to a custom focus. In one or more embodiments, scene-based image editing system 106 generates three-dimensional representation 7402 of two-dimensional image 7400. For example, scene-based image editing system 106 utilizes one or more neural networks to generate three-dimensional representation 7402. To illustrate, the scene-based image editing system 106 utilizes a neural network to estimate depth values of the two-dimensional image 7400. In some embodiments, scene-based image editing system 106 generates three-dimensional representation 7402 by generating one or more three-dimensional meshes corresponding to one or more objects in two-dimensional image 7400 from the estimated depth values.

In accordance with one or more embodiments, the scene-based image editing system 106 determines the input elements 7404 associated with the two-dimensional image 7400 and the three-dimensional representation 7402. In particular, the scene-based image editing system 106 determines the input elements 7404 based on user input via a graphical user interface (e.g., via mouse/touch input) and/or based on a three-dimensional representation of the user input. More specifically, scene-based image editing system 106 determines input elements relative to three-dimensional representation 7402.

To illustrate, the scene-based image editing system 106 determines a two-dimensional position or movement of the user input relative to the image space of the two-dimensional image 7400. In particular, the scene-based image editing system 106 detects an input (e.g., via a graphical user interface) to indicate a particular point in the image space of the two-dimensional image 7400. The scene-based image editing system 106 determines the input element 7404 in the image space at an indicated point in three-dimensional space relative to the three-dimensional representation 7402. Alternatively, scene-based image editing system 106 detects an input to move input element 7404 within three-dimensional representation 7402 in a direction corresponding to the input.

In one or more embodiments, scene-based image editing system 106 determines input element 7404 by generating a three-dimensional object in a three-dimensional space that includes three-dimensional representation 7402. In particular, the scene-based image editing system 106 generates three-dimensional objects within a three-dimensional space that are related to the focus of the two-dimensional image 7400. For example, in response to an initial input or request to set or modify the focus of two-dimensional image 7400, scene-based image editing system 106 generates three-dimensional objects (e.g., spheres, cubes, planes, points) in three-dimensional representation 7402. In additional embodiments, scene-based image editing system 106 modifies the position of the three-dimensional object in three-dimensional representation 7402 based on the position of input element 7404.

In some embodiments, scene-based image editing system 106 determines three-dimensional position 7406 based on input element 7404. In particular, scene-based image editing system 106 determines three-dimensional coordinates corresponding to the position of input element 7404 relative to three-dimensional representation 7402. For example, scene-based image editing system 106 determines three-dimensional position 7406 based on a center point of a three-dimensional object corresponding to input element 7404 within three-dimensional representation 7402. In other embodiments, the scene-based image editing system 106 determines the three-dimensional position 7406 based on a projection of two-dimensional coordinates of the input element 7404 (e.g., corresponding to a cursor or other input via a graphical user interface) into a three-dimensional space of the three-dimensional representation 7402.

As shown in fig. 74A, in response to determining the three-dimensional position 7406 corresponding to the input element 7404, the scene-based image editing system 106 determines a focus 7408 for the two-dimensional image 7400. Specifically, scene-based image editing system 106 determines focus 7408 within the three-dimensional space of three-dimensional representation 7402 based on three-dimensional position 7406. For example, scene-based image editing system 106 determines focus 7408 based on three-dimensional position 7406 and a camera position of a camera corresponding to two-dimensional image 7400. To illustrate, the scene-based image editing system 106 utilizes the distance between the three-dimensional position 7406 and the camera position to determine a focus 7408 for the camera.

In one or more embodiments, scene-based image editing system 106 generates modified two-dimensional image 7410 based on focal point 7408. In particular, scene-based image editing system 106 generates modified two-dimensional image 7410 by blurring one or more portions of two-dimensional image 7400 according to focus 7408. For example, scene-based image editing system 106 blurs portions of two-dimensional image 7400 based on a depth distance between focal point 7408 and the portions according to the camera position of two-dimensional image 7400. Further, in some embodiments, the scene-based image editing system 106 utilizes one or more blur preferences to determine blur intensities, blur distances, etc. to generate the modified two-dimensional image 7410.

As described above, in some embodiments, scene-based image editing system 106 determines movement of input element 7404 within the three-dimensional space of three-dimensional representation 7402. For example, scene-based image editing system 106 detects movement of input element 7404 relative to three-dimensional representation 7402 from a first position to a second position. Thus, scene based image editing system 106 detects movement from a first location to a second location and updates focus 7408 from the first location to the second location. The scene-based image editing system 106 generates an updated modified two-dimensional image based on the new focus.

In additional embodiments, in response to the range of movement of the input element 7404, the scene-based image editing system 106 continuously updates the graphical user interface to display the continuously modified two-dimensional image. Specifically, scene-based image editing system 106 determines the movement of focus 7408 based on the range of movement of input element 7404. The scene-based image editing system 106 also generates a plurality of different modified two-dimensional images with different blur based on the moving focus. In some embodiments, scene-based image editing system 106 generates an animation that obscures different portions of the two-dimensional image based on the range of movement of input element 7404 and focus 7408.

FIG. 74B illustrates the scene-based image editing system 106 utilizing a custom focus to modify blur of a two-dimensional image through three-dimensional rendering of a scene. In one or more embodiments, as described above, scene-based image editing system 106 determines focus 7408 of two-dimensional image 7400 of FIG. 74A based on input elements relative to the three-dimensional representation. For example, scene-based image editing system 106 determines focus 7408 for use in configuring a camera within a three-dimensional space that includes a three-dimensional representation.

In accordance with one or more embodiments, scene-based image editing system 106 utilizes focal point 7408 to determine camera parameters 7412 of the camera. Specifically, the scene-based image editing system 106 sets the focal length of the camera according to the indicated focus 7408. To illustrate, the scene-based image editing system 106 determines the focal distance based on the distance between the camera and the three-dimensional position of the focal point in three-dimensional space. In additional embodiments, scene-based image editing system 106 determines additional camera parameters, such as, but not limited to, field of view, camera angle, or lens radius, associated with focus 7408.

Further, in one or more embodiments, scene-based image editing system 106 utilizes three-dimensional renderer 7414 to generate modified two-dimensional image 7410a. In particular, the scene-based image editing system 106 renders a modified two-dimensional image 7410a from the three-dimensional representation of the scene of the two-dimensional image 7400 of FIG. 74A using a three-dimensional renderer 7414 having camera parameters 7412. For example, the scene-based image editing system 106 utilizes a three-dimensional renderer 7414 to render two-dimensional images from a three-dimensional representation using ray tracing or other three-dimensional rendering process.

By modifying the camera parameters 7412 for use by the three-dimensional renderer 7414 based on the focus 7408, the scene-based image editing system 106 generates a modified two-dimensional image 7410a to include the true focus blur. To illustrate, the three-dimensional renderer 7414 utilizes differences in depth values of portions of the three-dimensional representation in conjunction with camera parameters 7412 to determine blur of portions of the modified two-dimensional image 7410a. Thus, in response to the modification of focus 7408, scene-based image editing system 106 updates camera parameters 7412 and re-renders the two-dimensional image with updated focus blur. Utilizing the three-dimensional renderer 7414 allows the scene-based image editing system 106 to provide smooth/continuous blurring of portions of a scene of a two-dimensional image in conjunction with changes in focus relative to a three-dimensional representation of the two-dimensional image.

In additional embodiments, the scene-based image editing system 106 utilizes a two-dimensional rendering process to generate a modified two-dimensional image with a customized focus. For example, FIG. 74C illustrates the scene-based image editing system 106 blurring a two-dimensional image via a customized focus and estimated depth values of content in the two-dimensional image. In one or more embodiments, the scene-based image editing system 106 utilizes a two-dimensional rendering process to modify blur values for portions of a two-dimensional image in accordance with a three-dimensional understanding of the two-dimensional image.

In accordance with one or more embodiments, scene-based image editing system 106 utilizes focal point 7408 of two-dimensional image 7400 to determine two-dimensional location 7416 in image space of two-dimensional image 7400. In particular, scene-based image editing system 106 utilizes the three-dimensional position of focal point 7408 within the three-dimensional representation of two-dimensional image 7400 to determine two-dimensional position 7416. For example, the scene-based image editing system 106 utilizes a mapping (e.g., a UV mapping or other projection mapping) between three-dimensional space and image space to determine a two-dimensional location.

As shown in fig. 74C, scene-based image editing system 106 also determines depth value 7418 for two-dimensional position 7416 corresponding to focus 7408. In one or more embodiments, the scene-based image editing system 106 determines depth values 7418 from the depth map 7420 of the two-dimensional image 7400. In particular, scene-based image editing system 106 generates depth map 7420 in connection with generating a three-dimensional representation of two-dimensional image 7400. To illustrate, scene-based image editing system 106 generates a depth value for each pixel in two-dimensional image 7400 based on the location of one or more objects detected in two-dimensional image 7400 and stores the depth values (e.g., in a matrix or vector) for all pixels within depth map 7420. The scene-based image editing system 106 extracts depth values 7418 from the depth map 7410 that correspond to pixels at the two-dimensional location 7416.

In accordance with one or more embodiments, scene-based image editing system 106 utilizes depth values 7418 corresponding to focal points 7408 to determine blur in two-dimensional image 7400. As shown in fig. 74C, scene-based image editing system 106 utilizes two-dimensional renderer 7422 to apply blur filter 7424 based on focus 7408. For example, scene-based image editing system 106 determines blur filter 7424 based on depth values in depth map 7420 and depth values 7418 of focus 7408. To illustrate, scene-based image editing system 106 determines blur values for a plurality of pixels in two-dimensional image 7400 based on differences in depth values of pixels relative to depth values 7418 of focus 7408. For example, the scene-based image editing system 106 determines the blur filter 7424 according to various camera parameters indicative of blur intensity, blur distance, etc., and applies the blur filter 7414 to pixels in the two-dimensional image 7400 according to depth distance to generate the modified two-dimensional image 7410b.

In one or more embodiments, scene-based image editing system 106 also updates modified two-dimensional image 7410b in response to modified focus 7408. To illustrate, in response to modifying focus 7408 from two-dimensional position 7416 to an additional two-dimensional position, scene-based image editing system 106 utilizes two-dimensional image 7400 to generate an additional modified two-dimensional image. In particular, scene-based image editing system 106 determines additional two-dimensional locations based on a new three-dimensional location of focal point 7408 within a three-dimensional space that includes a three-dimensional representation of two-dimensional image 7400. For example, scene-based image editing system 106 determines an updated blur filter based on depth map 7420 and depth values of pixels corresponding to the updated focus. The scene based image editing system 106 utilizes the two-dimensional renderer 7422 to generate an updated two-dimensional image using the updated blur filter.

Fig. 75A-75E illustrate a graphical user interface of a client device for customizing a focus in a two-dimensional image. Specifically, as shown in fig. 75A, the client device displays a two-dimensional image 7500 for editing within the client application. For example, the client application includes an image editing application for editing a digital image through various editing operations. To illustrate, a client device provides a tool for editing a two-dimensional image through a three-dimensional representation of the two-dimensional image (e.g., a three-dimensional grid generated for the two-dimensional image).

In one or more embodiments, the client device displays a two-dimensional image 7500 for modifying the focus of the two-dimensional image 7500. For example, the scene-based image editing system 106 determines an intent to set or move a focus associated with the two-dimensional image 7500. To illustrate, the client device detects an input to indicate a location of a focus associated with a selected tool within the client application. Alternatively, the client device automatically infers an intent to indicate the focus position based on contextual information within the client application, such as user interaction with a portion of the two-dimensional image 7500 within the graphical user interface.

In conjunction with determining the focus of the two-dimensional image 7500, in at least some embodiments, the scene-based image editing system 106 determines input elements via a graphical user interface. Specifically, as previously described, the scene-based image editing system 106 determines the input elements from the position of the input relative to the two-dimensional image 7500 via the graphical user interface. Fig. 75B shows a graphical user interface of a modified two-dimensional image 7500a corresponding to the position of the input element 7502 a. For example, as shown, the client device displays the input element 7502a at a location within the modified two-dimensional image 7500a that corresponds to a three-dimensional location in three-dimensional space corresponding to the three-dimensional representation of the modified two-dimensional image 7500 a. To illustrate, the client device detects one or more inputs indicating a position of the input element 7502a within three-dimensional space.

In accordance with one or more embodiments, the scene-based image editing system 106 generates three-dimensional objects that correspond to (or otherwise represent) input elements 7502a within a three-dimensional space. Specifically, as shown, the scene-based image editing system 106 generates a sphere of a predetermined size and inserts the sphere into a three-dimensional space that includes a three-dimensional location at a particular location. For example, in connection with setting focus for modified two-dimensional image 7500a, scene-based image editing system 106 inserts spheres into a default or selected position in three-dimensional space. Furthermore, the scene-based image editing system 106 displays the input element 7502a as a two-dimensional representation of the sphere based on the position of the sphere in three-dimensional space.

In response to determining the position of the input element 7502a (and corresponding three-dimensional object) within the three-dimensional space, the scene-based image editing system 106 determines a focus of the modified two-dimensional image 7500a. The scene-based image editing system 106 generates one or more portions of the modified two-dimensional image 7500a with focus blur from the position of the input element 7502 a. More specifically, the client device displays a modified two-dimensional image 7500a including one or more blurred portions within the graphical user interface in conjunction with the location of the input element 7502 a.

Although fig. 75B shows the client device displaying the input element 7502a, in an alternative embodiment, the client device hides the input element 7502a within the graphical user interface. For example, the scene-based image editing system 106 may hide the input element 7502a from obscuring one or more portions of the modified two-dimensional image 7500a. Furthermore, in some embodiments, scene-based image editing system 106 displays only a portion of input element 7502a. In some embodiments, scene-based image editing system 106 displays only a cursor or input location corresponding to an input device of a client device. Illustratively, the scene-based image editing system 106 displays the modified two-dimensional image 7500a without displaying the input element 7502a.

In one or more embodiments, the scene-based image editing system 106 also modifies the two-dimensional image based on changes in the position of the input element. Fig. 75c shows a client device displaying a modified two-dimensional image 7500b based on the updated position of the input element 7502 b. In particular, scene-based image editing system 106 modifies the position of input element 7502b in response to user input moving input element 7502b from a first position to a second position. To illustrate, the scene-based image editing system 106 modifies the position of the input element 7502a of fig. 75B to the position of the input element 7502B of fig. 75c in response to input via the graphical user interface of the client device.

According to one or more embodiments, as shown, scene-based image editing system 106 modifies the blur of one or more portions of modified two-dimensional image 7500b based on the updated position of input element 7502 b. In particular, scene-based image editing system 106 determines movement of input element 7502b from a first position to a second position. Scene-based image editing system 106 determines one or more portions and blur values for the one or more portions based on the updated position of input element 7502 b.

Further, in one or more embodiments, the client device displays a fuzzy transition between locations of input elements. For example, when the scene-based image editing system 106 detects movement of an input element from a first location (e.g., the location of the input element 7502a of fig. 75B) to a second location (e.g., the location of the input element 7502B of fig. 75 c), the scene-based image editing system generates a plurality of modified two-dimensional images from the movement. The client device displays each modified two-dimensional image within the client application to provide a continuous transition (e.g., animation) of the blur effect within the scene. In an alternative embodiment, the client device updates the displayed two-dimensional image from the first location to the second location without displaying any intermediate transitions. In some embodiments, the client device displays the intermediate transition based on a predetermined time or distance threshold associated with movement of the input element.

In at least some embodiments, the scene-based image editing system 106 modifies the focus of the two-dimensional image in response to an input element indicating a particular portion of the two-dimensional image. In particular, fig. 75D shows a client device displaying a two-dimensional image 7500c having a focus determined based on an input element 7504 including a cursor. For example, the client device utilizes an input device (e.g., a mouse device, a touchpad device, or a touch screen device) that indicates the location of the input element 7504 within the graphical user interface. To illustrate, in response to input element 7504 indicating selected object 7506, scene-based image editing system 106 generates a modified two-dimensional image by determining a focus based on the position of selected object 7506.

In one or more embodiments, scene-based image editing system 106 determines a focus within the three-dimensional representation of two-dimensional image 7500c based on the position of input element 7504. In particular, scene-based image editing system 106 determines that the location of input element 7504 corresponds to a point within the three-dimensional representation. For example, scene-based image editing system 106 determines the focus based on vertices of selected objects 7506 corresponding to the locations of input elements 7504. Alternatively, scene-based image editing system 106 determines the focus based on the center (e.g., centroid) of the three-dimensional grid corresponding to selected object 7506.

In response to determining the focus of the two-dimensional image 7500c associated with the selected object 7506, the scene-based image editing system 106 generates a modified two-dimensional image based on the indicated focus. In one or more embodiments, scene-based image editing system 106 utilizes the focal point to further modify two-dimensional image 7500c. In particular, scene-based image editing system 106 modifies two-dimensional image 7500c by zooming in on selected object 7506. For example, fig. 75E shows a client device displaying an enlarged two-dimensional image 7508 according to a selected object 7506a.

More specifically, in response to the indication of the selected object 7506, the scene-based image editing system 106 generates an enlarged two-dimensional image 7508 by modifying the focus of the two-dimensional image 7500c of fig. 75D and one or more camera parameters. For example, as shown in fig. 75E, the scene-based image editing system 106 generates an enlarged two-dimensional image 7508 by modifying the camera position from the original position of the two-dimensional image to an updated position corresponding to the selected object 7506a. To illustrate, the scene-based image editing system 106 determines the boundaries of the selected object 7506a and moves the camera position within a three-dimensional space comprising a three-dimensional representation of a two-dimensional image to zoom in on the selected object 7506a while capturing the boundaries of the selected object 7506a. In some embodiments, the scene-based image editing system 106 determines the camera position based on a predetermined distance from the focal point in three-dimensional space.

In additional embodiments, the scene-based image editing system 106 also modifies one or more additional parameters of the camera within the three-dimensional space. For example, the scene-based image editing system 106 modifies the field of view, focal length, or other parameters of the camera based on the updated position and focus of the camera. Thus, in one or more embodiments, scene-based image editing system 106 generates enlarged two-dimensional image 7508 based on the new focus and updated parameters of the camera.

In one or more embodiments, the scene-based image editing system 102 provides various input elements for indicating the focus of a two-dimensional image. In particular, fig. 76A-76B illustrate a graphical user interface of a client device for indicating focus within a two-dimensional image. For example, FIG. 76A shows a modified two-dimensional image 7600 within a graphical user interface that includes a plurality of slider bars to indicate the focus of the modified two-dimensional image 7600. To illustrate, the first slider bar 7602a indicates the horizontal position of the focus, and the second slider bar 7602b indicates the vertical position of the focus. Alternatively, the second slider bar 7602b indicates a depth position of the focus instead of a vertical position.

Fig. 76B shows a modified two-dimensional image 7600a within the graphical user interface of the client device, including a region selector 7604 to indicate focus for the modified two-dimensional image 7600a. In particular, the scene-based image editing system 106 determines the location of the region selector 7604 within the modified two-dimensional image 7600a based on a portion of the modified two-dimensional image 7600a within the region selector 7604. To illustrate, the scene-based image editing system 106 determines the portion in response to determining that the portion occupies a majority of the area within the area selector 7604. Alternatively, scene-based image editing system 106 determines the portion in response to region selector 7604 performing a selection operation that "repairs" the portion at region selector 7604 (e.g., marks pixels corresponding to the portion) based on the depth of the portion. The scene based image editing system 106 generates a modified two-dimensional image 7600a by setting a focus according to the indicated portion.

In additional embodiments, the scene-based image editing system 106 provides a tool for performing additional operations within a two-dimensional image based on depth information of a three-dimensional representation of the two-dimensional image. For example, in some embodiments, the scene-based image editing system 106 provides a tool for selecting regions within a two-dimensional image based on three-dimensional depth values from a three-dimensional representation of the two-dimensional image. Fig. 77A-77C illustrate a graphical user interface of a client device for selecting different regions of a two-dimensional image based on three-dimensional depth values and input elements.

FIG. 77A illustrates a graphical user interface for selecting a portion of a two-dimensional image 7700 via a position of an input element 7702 within the graphical user interface. Specifically, the scene-based image editing system 106 determines the first portion 7704 of the two-dimensional image 7700 by converting the position of the input element 7702 into a three-dimensional position within a three-dimensional space corresponding to the three-dimensional representation of the two-dimensional image 7700. Further, the scene-based image editing system 106 determines the first portion 7704 based on the three-dimensional depth of the three-dimensional location and the corresponding three-dimensional depth of one or more other portions of the three-dimensional representation. Thus, as shown, the client device displays a selection of a first portion of the two-dimensional image 770, including a portion having a three-dimensional representation of depth values similar to the three-dimensional position of the input element 7702.

Fig. 77B illustrates a graphical user interface for selecting a second portion 7704a of a two-dimensional image 7700 a. In particular, in response to determining that the input element 7702a moved to a new position within the graphical user interface, the scene-based image editing system 106 determines an updated three-dimensional position relative to the three-dimensional representation of the two-dimensional image 7700 a. The scene-based image editing system 106 also determines depth values for one or more portions of the three-dimensional representation that are similar to the depth values of the updated three-dimensional position of the input element 7702 a. Thus, the client device displays a second portion 7704a that is selected in response to the updated position of the input element 7702 a. As shown, moving the input element 7702a to a new position relative to the two-dimensional image 7700a changes a selected portion of the two-dimensional image 7700a according to the corresponding depth value.

In additional embodiments, the scene-based image editing system 106 provides an option for custom selection of a size based on the depth of content in a two-dimensional image. For example, the scene-based image editing system 106 provides selectable options that indicate a range of depth values for selection based on the input element. Alternatively, scene-based image editing system 106 modifies the range of depth values based on one or more additional inputs, such as a shrink or shrink motion in response to a touch screen input, a scroll input via a mouse input, or other type of input.

In particular, fig. 77C shows a graphical user interface including a two-dimensional image 7700b with a third portion 7704b selected based on an input element 7702 b. To illustrate, in response to modifying the range of depth values via the parameters associated with the input element 7702b, the scene-based image editing system 106 reduces or increases the size of the selected region within the two-dimensional image 7700 b. Thus, as shown, the third portion 7704B has a smaller range of depth value selections than the second portion 7704a of fig. 77B.

The scene-based image editing system 106 may also modify a portion of the digital image when the digital image is selected. For example, although not shown in fig. 77A-77C, the scene-based image editing system may segment, remove, replace, fill, or otherwise modify (e.g., change color, change shading, or resize) a portion of the digital image selected based on the methods described herein. To illustrate, the scene-based image editing system 106 may apply any local image modification to a portion of the content of the two-dimensional image based on the identified focus corresponding to the three-dimensional position of the input element. In one or more embodiments, the scene-based image editing system 106 applies an image filter (e.g., a color filter, illumination filter, or blur filter) to a portion of the two-dimensional image at a depth corresponding to the input element. In additional embodiments, the scene-based image editing system 106 may modify or transform an object or portion of an object located at a depth corresponding to an input element, such as by resizing or warping the object or portion of the object.

In one or more embodiments, the scene-based image editing system 106 also provides a tool for selecting particular objects detected within a two-dimensional image based on depth values. Fig. 78A-78C illustrate a graphical user interface of a client device for selecting objects based on depth within a three-dimensional representation of a two-dimensional image. For example, fig. 78A illustrates a two-dimensional image 7800 including a plurality of objects (e.g., people) in a two-dimensional scene. In conjunction with generating the three-dimensional representation 7800 of the two-dimensional image, the scene-based image editing system 106 also determines three-dimensional depths of individual objects in the three-dimensional representation.

FIG. 78B illustrates that the scene-based image editing system 106 selects or highlights a portion of the two-dimensional image 7800a in response to an input (e.g., via the input element 7802 a). To illustrate, in response to determining the location of the input element 7802a, the scene-based image editing system 106 selects a first object 7804 within the two-dimensional image 7800a for highlighting within a graphical user interface of the client device based on a depth of a three-dimensional grid corresponding to the first object 7804. In some embodiments, in response to the input element 7802a being located at a determined depth corresponding to the first object 7804, the scene-based image editing system 106 selects the first object 7804. Alternatively, scene-based image editing system 106 selects first object 7804 based on the position of input element 7802a within the graphical user interface. For example, moving the input element 7802a in a particular direction within the graphical user interface causes the scene-based image editing system 106 to change the depth of the input element 7802a, thereby changing the selected object within the two-dimensional image 7800 a.

As an example, fig. 78B shows a two-dimensional image 7800B in response to moving an input element 7802B to a new position within a graphical user interface. To illustrate, in response to moving the input element 7802b, the scene-based image editing system 106 determines a second object 7804b corresponding to a depth of the input element 7802b and selects the second object 7804b. Thus, moving the input element 7802b in a particular direction (e.g., left to right or bottom to top) within the graphical user interface causes the scene-based image editing system 106 to cycle between objects within the two-dimensional image 7800b based on the depth value (e.g., closest to farthest) of the objects relative to the camera position. In some embodiments, although fig. 78B and 78C show input elements at different locations, scene-based image editing system 106 modifies the selection depth without displaying the input elements within the graphical user interface.

Although fig. 77A-77C and 78A-78B and the corresponding descriptions indicate that the scene-based image editing system 106 may modify the two-dimensional image by reconstructing a three-dimensional representation of the two-dimensional image, the scene-based image editing system 106 may also modify the two-dimensional image via the corresponding depth map. In particular, as depicted in fig. 74C, the scene-based image editing system 106 may generate a depth map of the two-dimensional image to apply local image modifications to the two-dimensional image according to the focus indicated by the input element. For example, the scene-based image editing system 106 may utilize depth values of a depth map of a two-dimensional image to select one or more portions of the two-dimensional image and then apply one or more image filters or image transforms to the selected portion(s) of the two-dimensional image. Thus, the scene-based image editing system 106 may modify color, illumination, resizing, warping, blurring, pixelation, or other filters/transformations to a portion of the two-dimensional image selected based on the detected depth corresponding to the input element within the two-dimensional image.

As described above, the scene-based image editing system 106 generates a three-dimensional grid for editing a two-dimensional image. FIG. 79 illustrates an overview of a scene-based image editing system 106 editing two-dimensional images by modifying a corresponding three-dimensional grid in a three-dimensional environment. In particular, FIG. 79 illustrates that the depth displacement system generates a three-dimensional grid to represent the content of a two-dimensional image in three-dimensional space. FIG. 79 also shows that scene-based image editing system 106 utilizes a three-dimensional grid to modify a two-dimensional image.

In one or more embodiments, as shown in FIG. 79, the scene-based image editing system 106 identifies a two-dimensional image 7900. In one or more embodiments, the two-dimensional image 7900 includes a raster image. For example, the two-dimensional image 7900 includes a digital photograph of a scene including one or more objects at one or more locations relative to a viewpoint associated with the two-dimensional image 7900 (e.g., based on camera position). In additional embodiments, the two-dimensional image 7900 includes a rendered image (e.g., a digital representation of a hand-drawn image or a digital image generated via a computing device) that includes a plurality of objects having relative depths.

In accordance with one or more embodiments, the scene-based image editing system 106 generates a displaced three-dimensional grid 7902 representing a two-dimensional image 7900. In particular, scene-based image editing system 106 utilizes a plurality of neural networks to generate a displaced three-dimensional mesh 7902, including a plurality of vertices and faces forming a geometric representation of an object from two-dimensional image 7900. For example, the scene-based image editing system 106 generates a displaced three-dimensional grid 7902 to represent depth information and displacement information (e.g., relative positions of objects) from a two-dimensional image 7900 in three-dimensional space. Figures 80-85 and the corresponding description provide additional details regarding generating an adaptive three-dimensional grid for a two-dimensional image. In an alternative embodiment, scene-based image editing system 106 generates a displaced three-dimensional grid of the two-dimensional image based on the estimated pixel depth values and the estimated camera parameters-e.g., by determining a location of each vertex of the tessellation corresponding to the object in the two-dimensional image from the estimated pixel depth values and the estimated camera parameters.

In one or more embodiments, the neural network includes a computer representation that is adjusted (e.g., trained) based on the input to approximate the unknown function. For example, neural networks include one or more layers or artificial neurons that approximate unknown functions by analyzing known data at different levels of abstraction. In some embodiments, the neural network includes one or more neural network layers, including but not limited to convolutional neural networks, recurrent neural networks (e.g., LSTM), graphical neural networks, or deep learning models. In one or more embodiments, the scene-based image editing system 106 utilizes one or more neural networks, including but not limited to semantic neural networks, object detection neural networks, density estimation neural networks, depth estimation neural networks, camera parameter estimation.

In additional embodiments, the scene-based image editing system 106 determines the modified three-dimensional grid 7904 in response to the displacement input. For example, in response to the displacement input 7900 for modifying the two-dimensional image, the scene-based image editing system 106 modifies the displaced three-dimensional grid 7902 to generate a modified three-dimensional grid 7904. Thus, modified three-dimensional grid 7904 includes one or more modified portions based on displacement input. Fig. 9-19B and the corresponding description provide additional details regarding modifying a three-dimensional grid based on displacement input.

FIG. 79 also shows that scene-based image editing system 106 generates modified two-dimensional image 7906 based on modified three-dimensional grid 7904. In particular, scene-based image editing system 106 generates modified two-dimensional image 7906 to include a modified portion of modified three-dimensional grid 7904. To illustrate, the scene-based image editing system 106 utilizes a mapping of the two-dimensional image 7900 to the displaced three-dimensional grid 7902 to reconstruct a modified two-dimensional image 7906 based on the modified three-dimensional grid 7904.

FIG. 80 shows a schematic diagram of a scene-based image editing system 106 generating a three-dimensional grid including depth displacement information from a two-dimensional image. Specifically, the scene-based image editing system 106 generates a three-dimensional grid by extracting depth information associated with objects in the two-dimensional image 8000. Further, the scene-based image editing system 106 extracts displacement information indicative of the relative position of objects in the two-dimensional image 8000 (e.g., from estimated camera parameters associated with the viewpoint of the two-dimensional image 8000). As shown in fig. 80, the scene-based image editing system 106 utilizes depth information and displacement information to generate a three-dimensional grid representing a two-dimensional image 8000.

In one or more embodiments, scene-based image editing system 106 determines disparity estimation map 8002 based on two-dimensional image 8000. For example, the scene-based image editing system 106 utilizes one or more neural networks to determine disparity estimates corresponding to pixels in the two-dimensional image 8000. To illustrate, the scene-based image editing system 106 utilizes a disparity estimation neural network (or other depth estimation neural network) to estimate depth values corresponding to pixels of the two-dimensional image 8000. More specifically, the depth value indicates a relative distance of a camera viewpoint associated with the image of each pixel in the image. In one or more embodiments, the depth values include (or are based on) disparity estimates for pixels of the scene-based image editing system 106.

In particular, the scene-based image editing system 106 utilizes neural network(s) to estimate the depth value of each pixel from objects within the two-dimensional image 8000 for which each object is placed in a given scene (e.g., how far in the foreground/background each pixel is located). The scene-based image editing system 106 may utilize various depth estimation models to estimate the depth value of each pixel. For example, in one or more embodiments, scene-based image editing system 106 utilizes a depth estimation neural network as described below: U.S. application Ser. No. 17/186,436, filed on 26 2 2021, entitled "GENERATING DEPTH IMAGES UTILIZING A MACHINE-LEARNING MODEL BUILT FROM MIXED DIGITAL IMAGE SOURCES AND MULTIPLE LOSS FUNCTION SETS" (generating depth images using a machine learning model built from a hybrid digital image source and multiple loss function set), which is incorporated herein by reference in its entirety. The scene-based image editing system 106 alternatively utilizes one or more other neural networks to estimate depth values associated with pixels of the two-dimensional image 8000.

As shown in fig. 80, in one or more embodiments, scene-based image editing system 106 also determines density map 8004 based on disparity estimation map 8002. Specifically, the scene-based image editing system 106 utilizes a set of filters to extract the density map 8004 from the disparity estimation map 8002. For example, the scene-based image editing system 106 utilizes the set of filters to determine the change in depth change (e.g., second derivative) of the disparity estimation map 8002 to determine the instantaneous rate of change in depth at each pixel of the two-dimensional image 8000. Thus, density map 8004 includes information indicating the density of detail in two-dimensional image 8000, with the highest density of information generally changing the fastest and lower density of other areas in depth in the edges of the object and in planar areas (e.g., sky or road) without significant detail.

FIG. 80 also shows that scene-based image editing system 106 determines sample points 8006 of two-dimensional image 8000 based on density map 8004. For example, scene-based image editing system 106 determines a set of points to sample associated with two-dimensional image 8000 based on density values in density map 8004. To illustrate, the scene-based image editing system 106 utilizes a sampling model that samples higher density points in higher density locations indicated by the density map 8004 (e.g., samples using a probability function that reflects the density of points). Thus, the scene-based image editing system 106 samples a higher number of points at locations where the two-dimensional image 8000 includes the greatest amount of depth information.

In response to determining sample point 8006, scene-based image editing system 106 generates mosaic 8008. Specifically, scene-based image editing system 106 generates an initial three-dimensional grid based on sampling points 8006. For example, scene-based image editing system 106 utilizes Delaunay triangulation to generate mosaic 8008 from Voronoi cells corresponding to sample points 8006. Thus, scene-based image editing system 106 generates a planar three-dimensional grid that includes vertices and faces with higher densities at portions where the sample point densities are higher.

As shown in fig. 80, scene-based image editing system 106 also generates a displaced three-dimensional grid 8010 based on mosaic 8008 of two-dimensional image 8000. In particular, the scene-based image editing system 106 utilizes one or more neural networks to determine perspective or viewpoint associated with the two-dimensional image 8000. The scene-based image editing system 106 generates a displaced three-dimensional grid 8010 by merging depth displacement information indicating the relative position of objects in the two-dimensional image 8000 from perspectives/viewpoints extracted from the two-dimensional image 8000. Thus, the scene-based image editing system 106 converts the planar three-dimensional mesh into a displaced three-dimensional mesh by modifying the locations of vertices in the mesh.

FIG. 81 illustrates additional details associated with determining a density map associated with a two-dimensional image. In particular, FIG. 81 shows that scene-based image editing system 106 applies a plurality of filters in relation to depth values extracted from two-dimensional image 8100. For example, scene-based image editing system 106 applies a filter to the disparity estimation map associated with two-dimensional image 8100. Alternatively, scene-based image editing system 106 applies a filter to other depth values associated with two-dimensional image 8100.

As shown in fig. 81, the scene-based image editing system 106 determines a hessian absolute value map 8102 based on a disparity estimation map of a two-dimensional image 8100 using a first filter. Specifically, the scene-based image editing system 106 utilizes a filter to generate a hessian matrix based on disparity estimates. For example, the scene-based image editing system 106 generates a hessian absolute value map 8102 from a hessian matrix that indicates a second derivative of a disparity estimation value that indicates a change in the change (e.g., rate of change) in depth information from the two-dimensional image 8100. To illustrate, the scene-based image editing system 106 generates a hessian absolute value map 8102 by determining absolute values of diagonals of a hessian matrix.

Further, as shown in fig. 81, the scene-based image editing system 106 applies a second filter to the hessian absolute value map 8102 to determine a smoothed value map 8104. For example, scene-based image editing system 106 modifies the absolute values in hessian absolute value map 8102 by smoothing the absolute values. To illustrate, the scene-based image editing system 106 utilizes convolution operations to generate a smoothed value map 8104 that includes smoothed values from the hessian absolute value map 8102. In some embodiments, by smoothing the values from the hessian absolute value map 8102, the scene-based image editing system 106 removes noise that may be introduced by determining the hessian matrix.

In one or more embodiments, scene-based image editing system 106 also modifies smoothed value map 8104 to determine density map 8106. Specifically, as shown in fig. 81, the scene-based image editing system 106 generates a density map 8106 by truncating (or cropping) values in the smoothed value map 8104 according to a predetermined threshold. For example, the scene-based image editing system 106 clips the values in the smoothed value map 8104 to a predetermined proportion of the standard deviation of the values (e.g., to 0.5 times the standard deviation). By truncating the values, scene-based image editing system 106 prevents large local changes in parallax from dominate the density of values in density map 8106.

In accordance with one or more embodiments, as illustrated, the density map 8106 includes higher density values at object boundaries and lower density values within the object boundaries of the two-dimensional image 8100. Further, the density map 8106 includes high density values of pixels within the object, indicating sharp transitions in depth (e.g., at the edges of windows of the building of fig. 81), while limiting density values to other areas (e.g., between individual leaves or clusters of leaves in the tree of fig. 81) that are not sharp transitions in depth. Thus, the scene-based image editing system 106 generates a density map 8106 to indicate a region of the two-dimensional image 8100 for sampling points such that the sampling points indicate regions of a rate of change of transitions according to depth information.

In one or more embodiments, scene-based image editing system 106 utilizes multiple filters with customizable parameters to determine density map 8106. For example, the filter may include parameters that provide a manually customizable density region, such as an edge of an image, to provide higher point sampling at the indicated region. In one or more additional embodiments, the scene-based image editing system 106 customizes the clipping threshold to include regions with higher or lower information densities, which may serve a particular implementation.

In one or more embodiments, scene-based image editing system 106 samples points of a two-dimensional image based on density values corresponding to pixels in the two-dimensional image. Specifically, as shown in fig. 82, the depth displacement system samples points according to density values to sample a greater number of points in dense areas and a lesser number of points in less dense areas. In accordance with one or more embodiments, the scene-based image editing system 106 utilizes a sampling model that determines random sampling 8200 from density values in a density map (e.g., by using the density map as a probability distribution of samples). To illustrate, the scene-based image editing system 106 randomly samples a plurality of points using a density map, thereby producing random sampling points with higher density sampling points in a high density region of a two-dimensional image.

In one or more alternative embodiments, scene-based image editing system 106 utilizes a sampling model that uses a density map as a probability distribution in an iterative sampling process. In particular, the scene-based image editing system 106 utilizes a sampling model that provides iterative movement of sampling locations that result in more uniform/better formed triangulation in a three-dimensional grid generated based on sampling points, rather than randomly sampling points according to density values. For example, the scene-based image editing system 106 utilizes a sampling model with a relaxation model to iteratively move sampling points in conjunction with Delaunay triangle splitting toward the center of the corresponding Voronoi cell. To illustrate, the scene-based image editing system 106 utilizes a sampling model (e.g., an "Lloyd" algorithm) with Voronoi iterations/relaxation that generates a centroid Voronoi mosaic, where the seed point of each Voronoi cell/region is also its centroid. More specifically, the scene-based image editing system 106 repeatedly moves each sample point of the corresponding Voronoi cell toward the centroid of the corresponding Voronoi cell.

Thus, in one or more embodiments, the scene-based image editing system 106 determines a first sampling iteration 8202 comprising a plurality of sampling points from a density map of a two-dimensional image. Furthermore, in one or more embodiments, scene-based image editing system 106 performs a plurality of iterations to further increase the regularity of the sampling from the density map of the two-dimensional image. Fig. 82 also shows that scene-based image editing system 106 determines a third sampling iteration 8204 that includes multiple sampling points after three sampling iterations. The three-dimensional mesh generated from the third sampling iteration 8204 includes more vertices and planes based on points sampled from the density map.

Fig. 82 also shows the 100 th sampling iteration 8206 after 100 sampling iterations. As shown, continuing to perform sampling iterations after a certain point may reduce the connection between the sampling points (and the generated three-dimensional grid) and the density map. Thus, in one or more embodiments, scene-based image editing system 106 determines the number of iterations based on the distance of the sampling points from the density map. Furthermore, in some embodiments, scene-based image editing system 106 determines the number of iterations based on a resource/time budget or resolution of the two-dimensional image. To illustrate, the scene-based image editing system 106 determines that two or three iterations provide a plurality of sampling points that produce a three-dimensional grid that preserves the boundaries of the objects of the two-dimensional image while maintaining consistency with the density map.

In one or more embodiments, scene-based image editing system 106 also utilizes image-aware sampling to ensure that scene-based image editing system 106 samples all portions of a two-dimensional image to generate a three-dimensional grid. For example, the scene-based image editing system 106 considers portions with little or no detail at edges or corners of the two-dimensional image to ensure that the resulting three-dimensional grid includes edges/corners in the three-dimensional grid. To illustrate, the scene-based image editing system 106 provides instructions to the sampling model to sample at least some points along an edge of the two-dimensional image based on dimensions/coordinates of the two-dimensional image (e.g., by adding density to image boundaries). Alternatively, the scene-based image editing system 106 provides a tool for a user to manually indicate sampling points during the generation of a three-dimensional grid representing a two-dimensional image.

FIG. 83 illustrates a scene-based image editing system 106 generating a three-dimensional grid including depth displacement information for two-dimensional image content. Specifically, fig. 83 shows that scene-based image editing system 106 determines sampling points 8300 (e.g., as described in fig. 82). Further, fig. 83 shows that scene-based image editing system 106 generates mosaic 8302 based on sampling points 8300. To illustrate, the scene-based image editing system 106 determines the sampling points 8300 and generates the tessellation 8302 in an iterative process that utilizes Voronoi relaxation and Delaunay triangulation.

In one or more embodiments, scene-based image editing system 106 modifies tessellation 8302 of a planar mesh including vertices and faces to include displacement information based on viewpoints in a two-dimensional image. For example, the scene-based image editing system 106 determines perspectives associated with the two-dimensional image 8303 (e.g., based on a camera capturing the two-dimensional image). By determining the viewpoint of the scene-based image editing system 106 and determining the displacement, the scene-based image editing system 106 incorporates depth information into a three-dimensional grid representing a two-dimensional image.

According to one or more embodiments, the scene-based image editing system 106 utilizes a neural network 8304 to estimate camera parameters 8306 associated with a viewpoint based on the two-dimensional image 8303. For example, the scene-based image editing system 106 utilizes the camera parameters to estimate a neural network to generate an estimated position, an estimated direction, and/or an estimated focal distance associated with the two-dimensional image 8303. To illustrate, the scene-based image editing system 106 utilizes a neural network as described below: U.S. patent No. 11,094,083, entitled "utilizeng A CRITICAL EDGE DETECTION NEURAL NETWORK AND A GEOMETRIC MODEL TO DETERMINE CAMERA PARAMETERS FROM A SINGLE DIGITAL IMAGE" (determination of camera parameters FROM a single digital image using a critical edge detection neural network and geometric model), filed on 1/25 in 2019, incorporated herein by reference in its entirety. In additional embodiments, the scene-based image editing system 106 extracts one or more camera parameters from metadata associated with the two-dimensional image 8303.

As shown in fig. 83, the scene-based image editing system 106 utilizes camera parameters 8306 to generate a displaced three-dimensional grid 8308. In particular, the scene-based image editing system 106 utilizes the camera parameters 8306 to estimate the position of vertices from the mosaic 8302 from the depth values of corresponding pixels of the two-dimensional image related to the position of the camera, the focal length of the camera, and/or the orientation of the camera. To illustrate, the scene-based image editing system 106 modifies the three-dimensional positions of the plurality of vertices and faces in three-dimensional space based on the relative positions of the objects in the two-dimensional image.

Furthermore, in one or more embodiments, scene-based image editing system 106 utilizes additional information to further modify the three-dimensional grid of the two-dimensional image. In particular, scene-based image editing system 106 utilizes additional information from the two-dimensional image to determine the locations of vertices in the three-dimensional mesh. For example, as shown in fig. 84A-84B, the scene-based image editing system 106 utilizes additional edge information to modify the three-dimensional grid of the two-dimensional image.

For example, fig. 84A shows the scene-based image editing system 106 generating a displaced three-dimensional grid 8400 for a two-dimensional image using the process described above with reference to fig. 83. As shown, the displaced three-dimensional grid 8400 includes displacement information based on the viewpoint of the two-dimensional image, which may result in long/deformed portions of the three-dimensional grid at the edges of the object. To illustrate, shifting certain edges of objects in the three-dimensional mesh 8400 may lack detail because there are not enough polygons to accurately represent the detail.

In one or more embodiments, scene-based image editing system 106 adds additional details to the three-dimensional grid (e.g., via additional vertices and faces). For example, the scene-based image editing system 106 uses color values (e.g., RGB values) from the two-dimensional image for a neural network that generates a displaced three-dimensional grid based on depth values and/or camera parameters. In particular, scene-based image editing system 106 utilizes color values to further increase the density of polygons at the edges of the three-dimensional mesh to reduce artifacts and/or remove long polygons. FIG. 84B illustrates the scene-based image editing system 106 generating additional displaced three-dimensional grid 8402 based on additional information. As shown, the additional information allows the scene-based image editing system 106 to provide a higher quality displaced three-dimensional grid with more accurate detail at the edges of the object.

As shown in fig. 84B, the scene-based image editing system 106 utilizes an edge map 8404, the edge map 8404 including additional information about edges within the two-dimensional image. For example, the edge map 8404 includes edges based on an initial edge detection process that highlights particular edges that may not correspond to high density regions. To illustrate, the scene-based image editing system 106 determines filters that mimic human rendering of edges in a two-dimensional image, automatically detect certain edges using neural networks, detect edges using a refined edge detector model, semantic segmentation or user input to determine corners/edges of a room, edges of planar objects such as paper, or another object for identifying additional edges to sample during the grid generation process. By utilizing the edge map 8404 to guide the displacement of vertices in the additional displaced three-dimensional grid 8402, the scene-based image editing system 106 provides more accurate edge details in the additional displaced three-dimensional grid 8402 via additional vertices at the indicated edges. In other embodiments, the scene-based image editing system 106 also performs an edge detection operation on the disparity estimation map corresponding to the two-dimensional image to determine sampling locations in the two-dimensional image. Such processing allows the scene-based image editing system 106 to arbitrarily add additional detail to the additional displaced three-dimensional grid 8402 based on additional information provided in connection with generating the additional displaced three-dimensional grid 8402.

FIG. 85 also shows that scene-based image editing system 106 provides additional details for generating a displaced three-dimensional grid for a two-dimensional image. For example, the scene-based image editing system 106 provides the user with one or more tools to indicate additional information to be added to the three-dimensional grid representing the two-dimensional image. Specifically, fig. 85 shows a two-dimensional image 8500 including an image of an automobile parked on a road overlooked by a landscape.

FIG. 85 also shows that user input has indicated a circle 8502 on the two-dimensional image 8500 for adding additional information to the displaced three-dimensional grid 8504 representing the two-dimensional image 8500. To illustrate, in response to a user input indicating circle 8502 on two-dimensional image 8500, scene-based image editing system 106 adds a circle to displaced three-dimensional grid 8504. For example, scene-based image editing system 106 adds additional vertices/faces to displacement three-dimensional grid 8504 at locations 8506 of displacement three-dimensional grid 8504 corresponding to circles 8502.

By adding additional information to the displaced three-dimensional grid 8504, the scene-based image editing system 106 provides additional flexibility in modifying the two-dimensional image 8500. For example, because scene-based image editing system 106 adds additional vertices/faces to displaced three-dimensional grid 8504 at location 8506, scene-based image editing system 106 provides the ability to modify selected portions without compromising the integrity of surrounding portions of displaced three-dimensional grid 8504. To illustrate, in response to a request to delete a portion of the two-dimensional image 8500 within the circle 8502, the scene-based image editing system 106 removes the corresponding portion of the displaced three-dimensional grid 8504 at the location 8506 of the displaced three-dimensional grid 8504. The scene-based image editing system 106 also provides additional options, such as deforming portions within the circle 8502 without compromising the geometry of portions of the displaced three-dimensional grid 8504 outside of the location 8506, or texturing portions within the circle 8502 separately from other portions of the two-dimensional image 8500.

Turning now to fig. 86, additional details regarding the various components and capabilities of the scene-based image editing system 106 will be provided. In particular, FIG. 86 illustrates a scene-based image editing system 106 implemented by a computing device 8600 (e.g., server(s) 102 and/or one of client devices 110a-110 n). Furthermore, scene-based image editing system 106 is also part of image editing system 104. As shown, in one or more embodiments, scene-based image editing system 106 includes, but is not limited to, grid generator 8602, including neural network(s) 8604, user interface manager 8606, image depth manager 8608, object manager 8610, camera manager 8612, and data store 8614.

As shown in fig. 86, scene-based image editing system 106 includes grid generator 8602 to generate a three-dimensional grid from a two-dimensional image. For example, mesh generator 8602 utilizes neural network(s) 8604 to estimate depth values for pixels of the two-dimensional image, and one or more filters to determine a density map based on the estimated depth values. In addition, grid generator 8602 samples points based on the density map and generates mosaics based on the sampled points. Grid generator 8602 also generates (e.g., using neural network(s) 8604) a displacement three-dimensional grid by modifying the locations of vertices in the tessellation to incorporate depth and displacement information into the three-dimensional grid representing the two-dimensional image.

The scene-based image editing system 106 also includes a user interface manager 8606 to manage user interactions related to modifying the two-dimensional image via various tools. For example, the user interface manager 8606 detects the locations of inputs (e.g., input elements) relative to the two-dimensional image and converts the locations into a three-dimensional space associated with a corresponding three-dimensional grid. The user interface manager 8606 also converts the changes made to the three-dimensional grid back to a corresponding two-dimensional image for display within the graphical user interface. In further embodiments, the user interface manager 8606 displays user interface content related to editing two-dimensional images, such as planar surfaces.

In accordance with one or more embodiments, the scene-based image editing system 106 utilizes the image depth manager 8608 to determine and utilize depth information associated with a scene of a two-dimensional image to modify the two-dimensional image. For example, the image depth manager 8608 determines a three-dimensional location and/or three-dimensional depth corresponding to an input (e.g., an input element) and/or content within a three-dimensional space. In additional embodiments, the image depth manager 8608 utilizes a three-dimensional representation and/or a depth estimation operation to generate a depth map of the two-dimensional image. The image depth manager 8608 utilizes depth information to modify the two-dimensional image in accordance with the determined position/depth.

The scene based image editing system 106 utilizes the object manager 8610 to manage objects in the two-dimensional image and a three-dimensional representation of the two-dimensional image. For example, the object manager 8610 generates or otherwise determines a three-dimensional grid corresponding to an object in three-dimensional space relative to a two-dimensional image. The object manager 8610 communicates with the image depth manager 8608 to perform operations on objects according to object depth. The object manager 8610 also provides object information to one or more other components of the scene-based image editing system 106.

In one or more embodiments, the scene-based image editing system 106 utilizes the camera manager 8612 to manage camera parameters associated with the two-dimensional image. Specifically, the camera manager 8612 estimates camera parameters of a camera that captures a two-dimensional image. The camera manager 8612 also manages camera parameters of the camera in a three-dimensional space corresponding to the three-dimensional representation of the two-dimensional image. The camera manager 8612 manages parameters of the camera in the three-dimensional space for rendering the modified two-dimensional image, such as focus, focal length, position, rotation, and the like.

Further, as shown in fig. 86, the scene-based image editing system 106 includes a data store 8614. In particular, the data store 8614 includes data associated with modifying a two-dimensional image based on a three-dimensional representation of the two-dimensional image. For example, the data store 8614 includes a neural network for generating a three-dimensional representation of a two-dimensional image. The data storage 8614 also stores three-dimensional representations. The data store 8614 also stores information used by the scene-based image editing system 106 to modify the two-dimensional image based on three-dimensional characteristics of the content of the two-dimensional image, such as depth values, camera parameters, input elements, parameters of the object, or other information.

Each component of scene-based image editing system 106 of fig. 86 optionally includes software, hardware, or both. For example, a component includes one or more instructions stored on a computer-readable storage medium and executable by a processor of one or more computing devices, such as a client device or a server device. The computer-executable instructions of the scene-based image editing system 106, when executed by one or more processors, cause the computing device(s) to perform the methods described herein. Alternatively, these components include hardware, such as special purpose processing devices that perform a particular function or group of functions. Alternatively, components of scene-based image editing system 106 include a combination of computer-executable instructions and hardware.

Further, components of scene-based image editing system 106 may be implemented, for example, as one or more operating systems, one or more stand-alone applications, one or more modules of an application, one or more plug-ins, one or more library functions, or functions that may be invoked by other applications, and/or a cloud computing model. Thus, the components of scene-based image editing system 106 may be implemented as stand-alone applications, such as a desktop or mobile application. Further, the components of the scene-based image editing system 106 may be implemented as one or more web-based applications hosted on a remote server. Alternatively or additionally, components of scene-based image editing system 106 may be implemented in a set of mobile device applications or "apps. For example, in one or more embodiments, scene-based image editing system 106 includes or incorporates, for example Or-> Is operative with the digital software application of (c). The above are registered trademarks or trademarks of Adobe corporation in the united states and/or other countries/regions.

Turning now to FIG. 87, a flow chart of a series of actions 8700 for modifying shadows in a two-dimensional image based on three-dimensional characteristics of the two-dimensional image is shown. While FIG. 87 illustrates acts in accordance with one embodiment, alternative embodiments may omit, add, reorder, and/or modify any of the acts illustrated in FIG. 87. The actions of FIG. 87 are part of a method. Alternatively, the non-transitory computer-readable medium includes instructions that, when executed by the one or more processors, cause the one or more processors to perform the actions of fig. 87. In yet another embodiment, the system includes a processor or server configured to perform the actions of FIG. 87.

As shown, the series of acts 8700 includes an act 8702 of generating a three-dimensional grid representing a two-dimensional image. Further, the series of actions 8700 includes an action 8704 of determining estimated three-dimensional characteristics of objects placed within the scene of the two-dimensional image based on the three-dimensional mesh. The series of acts 8700 further includes an act 8706, the act 8706 generating a modified two-dimensional image with updated shadows based on the location of the object.

In one or more embodiments, the series of actions 8700 includes determining, by at least one processor, estimated three-dimensional characteristics of one or more background objects in a scene of a two-dimensional image. The series of actions 8700 includes determining, by at least one processor, a request to place an object at a selected location within a scene of a two-dimensional image. The series of acts 8700 further includes generating, by the at least one processor, a modified two-dimensional image including the one or more updated shadows based on the selected locations of the objects and the estimated three-dimensional characteristics of the one or more background objects.

In one or more embodiments, the series of actions 8700 includes generating a three-dimensional grid for the two-dimensional image using one or more neural networks based on pixel depth values corresponding to one or more foreground objects and one or more background objects of the scene of the two-dimensional image and estimated camera parameters of a camera position corresponding to the two-dimensional image.

In one or more embodiments, the series of actions 8700 includes generating an object segmentation map for one or more foreground objects and one or more background objects of a scene of a two-dimensional image. The series of actions 8700 includes generating a plurality of separate three-dimensional meshes for one or more foreground objects and one or more background objects from the object segmentation map.

According to one or more embodiments, the series of actions 8700 includes determining that the request includes moving the object from a first location in the two-dimensional image to a second location in the two-dimensional image. The series of actions 8700 includes generating a proxy three-dimensional mesh at a three-dimensional location corresponding to a selected location within the two-dimensional scene from the shape of the object. The series of actions 8700 further includes removing a first shadow corresponding to the object at a first location in the two-dimensional image from the two-dimensional image. Further, the series of actions 8700 includes generating a second shadow corresponding to the object at a second location in the two-dimensional image using the proxy three-dimensional mesh.

In some embodiments, the series of actions 8700 includes determining an axis of symmetry corresponding to the object from features of a visible portion of the object within the two-dimensional image. The series of actions 8700 further includes generating a three-dimensional mesh based on the symmetry axis, including a first three-dimensional portion corresponding to the visible portion of the object and a mirrored three-dimensional portion of the first three-dimensional portion corresponding to the invisible portion of the object.

In one or more embodiments, the series of actions 8700 includes determining that the object corresponds to a predetermined subset of objects. Further, the series of actions 8700 includes generating a three-dimensional mesh representing the object using a machine learning model trained for a predetermined subset of the object.

In at least some embodiments, the series of actions 8700 includes determining one or more shadow maps of the object at the selected location from a three-dimensional grid representing the object and one or more additional three-dimensional grids representing one or more background objects. The series of actions 8700 includes generating a modified two-dimensional image based on the one or more shadow maps, the estimated camera parameters of the two-dimensional image, and the estimated illumination parameters of the two-dimensional image, the modified two-dimensional image including rendered shadows of the object at the selected location.

In one or more embodiments, the series of actions 8700 includes generating a three-dimensional grid for a two-dimensional image using one or more neural networks based on pixel depth values corresponding to one or more background objects in a scene of the two-dimensional image. The series of actions 8700 further includes determining an estimated three-dimensional characteristic of the object relative to one or more background objects based on the three-dimensional grid of the two-dimensional image and in response to a request to place the object at a selected location within the scene of the two-dimensional image. The series of actions 8700 further includes generating a modified two-dimensional image including one or more updated shadows based on the selected locations of the objects and the estimated three-dimensional characteristics of the objects relative to the one or more background objects.

According to one or more embodiments, the series of actions 8700 includes determining object segmentation for one or more background objects in a scene of a two-dimensional image. The series of actions 8700 includes generating one or more three-dimensional meshes representing one or more background objects in three-dimensional space. The series of actions 8700 further includes generating a three-dimensional mesh representing the object in three-dimensional space.

In one or more embodiments, the series of actions 8700 includes determining a three-dimensional position based on a selected position within a scene of a two-dimensional image. The series of actions 8700 includes placing a three-dimensional mesh representing an object in three-dimensional space in a three-dimensional position.

In one or more embodiments, the series of actions 8700 includes determining that the object comprises a foreground object in a two-dimensional image. Further, the series of actions 8700 includes generating a proxy three-dimensional mesh that represents the object and is hidden within a graphical user interface that displays the two-dimensional image.

The series of actions 8700 includes determining an axis of symmetry of the object based on features of the object in the two-dimensional image. In one or more embodiments, the series of actions 8700 includes generating a proxy three-dimensional mesh representing the object by mirroring a portion of the three-dimensional mesh corresponding to the visible portion of the object on an axis of symmetry.

In one or more embodiments, the series of actions 8700 includes determining a portion of a three-dimensional grid at a three-dimensional position in three-dimensional space that corresponds to an initial position of an object in a two-dimensional image. The series of actions 8700 includes generating a replacement three-dimensional mesh portion at a three-dimensional location from estimated depth values of regions in three-dimensional space adjacent to the portion of the three-dimensional mesh using the smoothing model. The series of actions 8700 further includes generating an embedded region in the modified two-dimensional image using the neural network based on the background feature corresponding to the initial position of the object in the two-dimensional image.

In at least some embodiments, the series of actions 8700 includes determining that the request includes moving an object from a first location within a scene of the two-dimensional image to a second location within the scene of the two-dimensional image. Further, the series of actions 8700 includes generating a modified two-dimensional image by moving an object from a first location within a scene of the two-dimensional image to a second location within the scene of the two-dimensional image. The series of actions 8700 includes generating a modified two-dimensional image by removing a first shadow from the two-dimensional image that corresponds to the object at the first location. The series of actions 8700 includes generating a modified two-dimensional image for display within the modified two-dimensional image by generating a second shadow corresponding to the object at a second location based on the estimated three-dimensional characteristics of the object relative to the one or more background objects.

In one or more embodiments, the series of actions 8700 includes determining an estimated three-dimensional characteristic of one or more background objects in a scene of a two-dimensional image. The series of actions 8700 further includes determining a request to place the object at a selected location within the scene of the two-dimensional image in response to an input interacting with the two-dimensional image within the graphical user interface of the display device. Further, the series of actions 8700 includes generating a modified two-dimensional image based on the position of the object and the estimated three-dimensional characteristics of the one or more background objects for display within the graphical user interface in response to the request, the modified two-dimensional image including one or more updated shadows.

According to one or more embodiments, the series of actions 8700 includes determining an object segmentation map of a scene of the two-dimensional image based on pixel depth values of the two-dimensional image, the object segmentation map including a segmentation of the object and one or more additional segmentations of one or more background objects. Further, the series of actions 8700 includes generating a foreground three-dimensional grid corresponding to the object based on the segmentation of the object. The series of acts 8700 further includes generating one or more background three-dimensional meshes corresponding to the one or more background objects based on the one or more additional segmentations of the one or more background objects.

In one or more embodiments, the series of actions 8700 includes determining an estimated three-dimensional characteristic of the object relative to the one or more background objects including generating a proxy three-dimensional mesh corresponding to the object. The series of acts 8700 further includes generating a modified two-dimensional image including generating shadows for rendering within the modified two-dimensional image based on the proxy three-dimensional mesh corresponding to the object.

The series of actions 8700 further includes determining that the request includes moving the object from a first location within the scene of the two-dimensional image to a second location within the scene of the two-dimensional image. Further, the series of actions 8700 includes generating a modified two-dimensional image by removing a first shadow of the object at a first location and generating a second shadow of the object at a second location using a proxy three-dimensional grid representing the object according to the estimated three-dimensional characteristics of the one or more background objects.

Turning now to FIG. 88, a flow chart of a series of actions 8800 for modifying shadows in a two-dimensional image with multiple shadow maps of an object for the two-dimensional image is shown. Although FIG. 88 illustrates acts in accordance with one embodiment, alternative embodiments may omit, add, reorder, and/or modify any of the acts illustrated in FIG. 88. The actions of FIG. 88 are part of a method. Alternatively, the non-transitory computer-readable medium includes instructions that, when executed by the one or more processors, cause the one or more processors to perform the actions of fig. 88. In yet another embodiment, the system includes a processor or server configured to perform the actions of FIG. 88.

As shown, the series of actions 8800 includes an action 8802 of generating a shadow map corresponding to an object placed within a scene of a two-dimensional image. Further, the series of actions 8800 includes an action 8804 of generating an estimated shadow map of the two-dimensional image based on shadows detected in the two-dimensional image. The series of actions 8800 also includes an action 8806 of generating a modified two-dimensional image based on the shadow map and the estimated shadow map of the object.

In one or more embodiments, the series of actions 8800 includes generating, by at least one processor, a shadow map corresponding to the object from three-dimensional characteristics of the object in response to a request to place the object at a selected location within a scene of the two-dimensional image. In one or more embodiments, the series of actions 8800 includes generating, by at least one processor, an estimated shadow map for the two-dimensional image based on the one or more shadows detected in the two-dimensional image and the estimated camera parameters of the two-dimensional image. In accordance with one or more embodiments, the series of acts 8800 includes generating, by at least one processor, a modified two-dimensional image based on a shadow map corresponding to the object and an estimated shadow map of the two-dimensional image in connection with placing the object in a selected position.

In some embodiments, the series of actions 8800 includes placing a three-dimensional mesh corresponding to the object at a three-dimensional location within a three-dimensional space corresponding to a selected location within a scene of the two-dimensional image. The series of actions 8800 includes determining a shadow map corresponding to the object based on a three-dimensional position of the three-dimensional mesh in three-dimensional space.

The series of actions 8800 includes determining that the request includes moving an object from a first location within a scene of the two-dimensional image to a second location within the scene of the two-dimensional image. The series of actions 8800 further includes generating a proxy three-dimensional mesh representing the object from features of the object extracted from the two-dimensional image.

In some embodiments, the series of actions 8800 includes generating a shadow map based on the estimated camera position of the two-dimensional image, the estimated illumination parameters of the two-dimensional image, and a proxy three-dimensional grid at a three-dimensional position within the three-dimensional space.

The series of actions 8800 includes determining that the request includes importing an object into the two-dimensional image to place at the selected location. The series of actions 8800 further includes placing an imported three-dimensional mesh representing the object in a two-dimensional image at a three-dimensional location within the three-dimensional space. The series of actions 8800 further includes generating a shadow map based on the estimated camera position of the two-dimensional image and the imported three-dimensional mesh representing the object at the three-dimensional position within the three-dimensional space.

In one or more embodiments, the series of actions 8800 includes determining additional objects corresponding to the two-dimensional image. The series of actions 8800 further includes generating an additional shadow map corresponding to the additional object from the three-dimensional characteristics of the additional object in response to the additional object being a different object type than the object. Further, the series of actions 8800 includes generating a modified two-dimensional image based on the shadow map corresponding to the object, the estimated two-dimensional image, and the additional shadow map corresponding to the additional object.

According to one or more embodiments, the series of actions 8800 includes: based on the estimated camera parameters of the two-dimensional image, a relative position of the object and one or more additional objects corresponding to the two-dimensional image is determined from the three-dimensional characteristics of the object and the estimated three-dimensional characteristics of the one or more additional objects. The series of actions 8800 further includes merging the shadow map of the object with the estimated shadow map of the two-dimensional image based on the relative positions of the object and the one or more additional objects.

In some examples, the series of actions 8800 includes generating at least a partial shadow on an object within the modified two-dimensional image from one or more additional objects in the two-dimensional image or from a scene shadow detected in a scene of the two-dimensional image, based on a relative position of the object and the one or more additional objects corresponding to the two-dimensional image.

In one or more embodiments, the series of actions 8800 includes generating a first shadow map in response to a request to place an object at a selected location within a scene of a two-dimensional image, the first shadow map including a first shadow type corresponding to the object according to three-dimensional characteristics of the object at the selected location. In one or more embodiments, the series of actions 8800 includes generating a second shadow map including a second shadow type corresponding to the two-dimensional image based on the one or more shadows detected in the two-dimensional image and the estimated camera parameters of the two-dimensional image. In some embodiments, the series of actions 8800 includes generating a modified two-dimensional image including the object at the selected location by merging the first shadow map and the second shadow map in conjunction with camera parameters of the estimated two-dimensional image.

In at least some embodiments, the series of actions 8800 includes determining that the object corresponds to an object type that includes a set of object properties. For example, the series of actions 8800 includes generating a modified two-dimensional image including one or more shadows from an object feature set of an object type.

In some embodiments, the series of actions 8800 includes generating a proxy three-dimensional grid of objects in response to determining that the request includes moving the objects from a first location within a scene of the two-dimensional image to a second location within the scene of the two-dimensional image. The series of actions 8800 includes generating a first shadow map based on the proxy three-dimensional mesh and the estimated camera position of the two-dimensional image.

According to one or more embodiments, the series of actions 8800 includes determining that a shadow portion of a scene from a two-dimensional image is projected onto an object based on a second shadow map of the two-dimensional image and a three-dimensional position of the object. The series of actions 8800 includes generating a modified two-dimensional image including a shadow portion from a scene of the two-dimensional image on the object at the selected location.

In one or more embodiments, the series of actions 8800 includes inserting the object into a two-dimensional image at a selected location, inserting an imported three-dimensional mesh for the object, in response to determining that the request includes inserting the object into the two-dimensional image. The series of actions 8800 includes generating a first shadow map based on the imported three-dimensional mesh and the estimated camera position of the two-dimensional image.

In some embodiments, the series of actions 8800 includes determining that the first shadow type includes a proxy shadow type related to a proxy three-dimensional mesh representing the object. The series of acts 8800 includes determining that the second shadow type includes a scene shadow type related to one or more objects that cast one or more shadows within a scene of the two-dimensional image.

The series of actions 8800 includes generating a third shadow map including a third shadow type corresponding to the additional object based on the three-dimensional characteristics of the additional object, the third shadow type including a different shadow type than the first shadow type and the second shadow type. The series of acts 8800 further includes generating a modified two-dimensional image by merging the third shadow map with the first shadow map and the second shadow map.

According to one or more embodiments, the series of actions 8800 includes generating a foreground shadow map corresponding to a foreground object from three-dimensional characteristics of the foreground object in response to a request to place the foreground object at a selected location within a scene of a two-dimensional image. In some embodiments, the series of actions 8800 includes generating a background shadow map of the two-dimensional image based on the one or more shadows detected in the background of the two-dimensional image and the estimated camera parameters of the two-dimensional image. Further, the series of actions 8800 includes generating a modified two-dimensional image by combining the foreground shadow map and the background shadow map in connection with placing the foreground object in the selected position.

In some embodiments, the series of actions 8800 includes determining that the request includes moving a foreground object from a first location within a scene of the two-dimensional image to a second location within the scene of the two-dimensional image. The series of actions 8800 includes generating, based on the request, a proxy three-dimensional mesh representing the foreground object from features of the foreground object in the two-dimensional image. Further, the series of actions 8800 includes generating a foreground shadow map based on the proxy shadows corresponding to the proxy three-dimensional grid and the estimated camera locations of the two-dimensional image.

The series of actions 8800 includes generating a pixel depth map corresponding to pixels of the two-dimensional image. The series of actions 8800 further includes generating a background shadow map based on the one or more shadows detected in the background of the two-dimensional image, the pixel depth map, and the estimated camera parameters of the two-dimensional image.

In one or more embodiments, the series of actions 8800 includes determining a position of a particular object relative to a foreground object and a background of a two-dimensional image based on a three-dimensional representation of a scene of the two-dimensional image. The series of actions 8800 further includes determining, based on the position of the particular object relative to the foreground object and the background, that shadows of the foreground object or shadows detected in the background cover a portion of the particular object using the foreground shadow map and the background shadow map.

Turning now to fig. 89, a flow chart of a series of actions 8900 for generating a scale field indicative of a pixel-to-metric distance ratio of a two-dimensional image is shown. Although FIG. 89 illustrates acts in accordance with one embodiment, alternative embodiments may omit, add, reorder, and/or modify any of the acts illustrated in FIG. 89. The acts of fig. 89 are part of a method. Alternatively, the non-transitory computer-readable medium includes instructions that, when executed by the one or more processors, cause the one or more processors to perform the actions of fig. 89. In yet another embodiment, the system includes a processor or server configured to perform the actions of FIG. 89.

As shown, the series of actions 8900 includes an action 8902 of generating a feature representation of a two-dimensional image. Further, the series of actions 8900 includes an action 8904 for the two-dimensional image scale field based on the feature representation that includes a value indicative of a pixel-to-metric distance ratio in the two-dimensional image. The series of actions 8900 also includes an action 8906 of generating a metric distance for the content from the scale field. Alternatively, the series of actions 8900 includes an action 8908 of modifying the two-dimensional image according to the scale field.

In one or more embodiments, the series of actions 8900 includes generating a feature representation of a two-dimensional image using one or more neural networks. In some embodiments, the series of actions 8900 includes generating a scale field of the two-dimensional image using the one or more neural networks and based on the feature representation, the scale field including a plurality of values indicative of a ratio of pixel distances in the two-dimensional image to metrology distances in a three-dimensional space corresponding to the two-dimensional image. In additional embodiments, the series of actions 8900 includes at least one of performing a metric distance generation of content depicted in the two-dimensional image from a scale field of the two-dimensional image or modifying, by the at least one processor, content in the two-dimensional image from the scale field of the two-dimensional image.

In some embodiments, the series of actions 8900 includes generating a plurality of ground-to-horizon vectors in three-dimensional space from a horizon of a two-dimensional image in three-dimensional space using one or more neural networks and based on the feature representations. In further embodiments, the series of actions 8900 includes generating a ground-to-horizon vector that indicates a distance and direction from a three-dimensional point corresponding to a pixel of the two-dimensional image to a horizon in three-dimensional space.

According to one or more embodiments, the series of actions 8900 includes generating a ratio for pixels of the two-dimensional image that indicates a ratio of pixel distances in the two-dimensional image to corresponding three-dimensional distances in the three-dimensional space relative to a camera height of the two-dimensional image.

In at least some embodiments, the series of actions 8900 includes determining a pixel distance between a first pixel corresponding to the content and a second pixel corresponding to the content. The series of actions 8900 includes generating a metric distance based on the pixel distance and a ratio of the pixel distance in the two-dimensional image to the metric distance in the three-dimensional space.

The series of actions 8900 includes determining a value of a scale field corresponding to the first pixel. Further, the series of actions 8900 includes converting a value of the scale field corresponding to the first pixel to a metric distance based on a pixel distance between the first pixel and the second pixel.

In an additional embodiment, the series of actions 8900 includes determining a pixel location of an object placed within the two-dimensional image and determining a scale of the object based on the pixel location and the scale field. The series of actions 8900 includes determining an initial size of the object and inserting the object into a pixel location having a modified size based on a ratio indicated by a value of a scale field at the pixel location from the object.

In one or more embodiments, the series of actions 8900 includes generating estimated depth values for a two-dimensional image for a plurality of pixels of the two-dimensional image projected into a corresponding three-dimensional space. The series of actions 8900 further includes determining a horizon for the two-dimensional image based on the estimated camera height of the two-dimensional image. The series of actions 8900 further includes generating a reference truth scale field for the two-dimensional image based on the plurality of ground-to-horizon vectors in the respective three-dimensional space according to the estimated depth values for the plurality of pixels and the horizon. Further, the series of actions 8900 includes modifying parameters of one or more neural networks based on the reference truth scale field of the two-dimensional image.

According to one or more embodiments, the series of actions 8900 includes generating estimated depth values for a plurality of pixels of a two-dimensional image projected into a three-dimensional space for the two-dimensional images of the plurality of two-dimensional images. The series of actions 8900 further includes generating a scale field for the two-dimensional image from a horizon of the two-dimensional image, the scale field including a plurality of values indicative of a ratio of ground-to-horizon vector lengths in three-dimensional space relative to pixel distances of a plurality of pixels of the two-dimensional image. Further, the series of actions 8900 includes modifying parameters of one or more neural networks based on a scale field of the two-dimensional image.

In one or more embodiments, the series of actions 8900 includes projecting a plurality of pixels of a two-dimensional image into a three-dimensional space using one or more neural networks. The series of actions 8900 also includes determining a horizon in three dimensions based on camera heights of the two-dimensional images. Further, the series of actions 8900 includes generating a plurality of ground-to-horizon vectors based on the estimated depth values, the plurality of ground-to-horizon vectors representing a plurality of metric distances between ground points corresponding to a plurality of pixels of the two-dimensional image and a horizon in three-dimensional space.

According to one or more embodiments, the series of actions 8900 includes determining a pixel distance between a first pixel of a plurality of pixels and a second pixel corresponding to a horizon of a two-dimensional image. The series of actions 8900 further includes generating a value for the first pixel that represents a ratio between a pixel distance corresponding to the two-dimensional image and a camera height.

In some embodiments, the series of actions 8900 includes generating an estimated scale field of a two-dimensional image using one or more neural networks. The series of actions 8900 further includes determining a loss based on the scale field of the two-dimensional image and the estimated scale field of the two-dimensional image. The series of actions 8900 further includes modifying parameters of one or more neural networks based on the loss.

In one or more embodiments, the series of actions 8900 includes generating a feature representation of the additional two-dimensional image using one or more neural networks. The series of actions 8900 further includes generating additional scale fields for the additional two-dimensional image using one or more neural networks. In one or more embodiments, the series of actions 8900 includes modifying the additional two-dimensional image by placing an object within the additional two-dimensional image having an object size based on an additional scale field of the additional two-dimensional image. In some embodiments, the series of actions 8900 includes determining a metric distance for content depicted in the additional two-dimensional image from an additional scale field of the additional two-dimensional image.

In accordance with one or more embodiments, the series of actions 8900 includes generating a feature representation of a two-dimensional image using one or more neural networks including parameters learned from a plurality of digital images having annotated horizon and horizon. The series of actions 8900 further includes generating a scale field of the two-dimensional image using the one or more neural networks and based on the feature representation, the scale field including a plurality of values indicative of a ratio of the pixel distance relative to a camera height of the two-dimensional image. In some embodiments, the series of actions 8900 includes performing at least one of generating a measured distance of an object depicted in the two-dimensional image from a scale field of the two-dimensional image, or modifying the object in the two-dimensional image from the scale field of the two-dimensional image.

In one or more embodiments, the series of actions 8900 includes generating, for a pixel of a two-dimensional image, a value representing a ratio between a pixel distance from the pixel to a horizon of the two-dimensional image and a camera height of the two-dimensional image.

The series of actions 8900 also includes determining a pixel corresponding to a location of the two-dimensional image. The series of actions 8900 further includes determining a scale size of the object based on values from a scale field of pixels corresponding to a location of the two-dimensional image. The series of actions 8900 further includes inserting the object at a location of the two-dimensional image according to a scaled size of the object.

Turning now to FIG. 90, a flow diagram of a series of acts 9000 of generating a three-dimensional human model of a two-dimensional human being in a two-dimensional image is shown. While FIG. 90 illustrates acts in accordance with one embodiment, alternative embodiments may omit, add, reorder, and/or modify any of the acts illustrated in FIG. 90. The acts of fig. 90 are part of a method. Alternatively, the non-transitory computer-readable medium includes instructions that, when executed by the one or more processors, cause the one or more processors to perform the actions of fig. 90. In yet another embodiment, the system includes a processor or server configured to perform the actions of FIG. 90.

As shown, the series of acts 9000 includes an act 9002 of extracting two-dimensional pose data from a two-dimensional human being in a two-dimensional image. Further, the series of actions 9000 includes an action 9004 of extracting three-dimensional pose data and three-dimensional shape data corresponding to a two-dimensional human being. The series of acts 9000 further includes an act 9006 of generating a three-dimensional human model representing a two-dimensional human being based on the two-dimensional data and the three-dimensional data.

In one or more embodiments, the series of actions 9000 includes extracting two-dimensional gesture data from a two-dimensional human being extracted from a two-dimensional image using one or more neural networks. In some embodiments, the series of actions 9000 includes extracting three-dimensional pose data and three-dimensional shape data corresponding to a two-dimensional human being extracted from a two-dimensional image using one or more neural networks. In other embodiments, the series of actions 9000 includes generating a three-dimensional human model representing a two-dimensional human in a three-dimensional space corresponding to the two-dimensional image by combining the two-dimensional pose data with the three-dimensional pose data and the three-dimensional shape data.

According to one or more embodiments, the series of actions 9000 includes extracting, with a first neural network, two-dimensional gesture data comprising a two-dimensional skeleton having a two-dimensional skeleton and annotations indicative of one or more portions of the two-dimensional skeleton. The series of actions 9000 further includes extracting, with the second neural network, three-dimensional pose data comprising a three-dimensional skeleton having a three-dimensional skeleton and three-dimensional shape data comprising a three-dimensional mesh according to a two-dimensional human. The series of actions 9000 further includes extracting three-dimensional hand pose data corresponding to one or more hands of a two-dimensional human using a third neural network for a hand-specific bounding box.

In one or more embodiments, the series of actions 9000 includes iteratively adjusting one or more bones in the three-dimensional pose data according to one or more corresponding bones in the two-dimensional pose data. In further embodiments, the series of actions 9000 comprises iteratively connecting one or more hand bones having three-dimensional hand pose data to a body bone having three-dimensional body pose data.

In some embodiments, the series of actions 9000 includes generating a modified three-dimensional human model having modified three-dimensional pose data in response to an indication to modify a pose of a two-dimensional human within the two-dimensional image. The series of actions 9000 includes generating a modified two-dimensional image comprising a modified two-dimensional human based on the modified three-dimensional human model.

The series of actions 9000 includes determining interactions between the modified three-dimensional human model and additional three-dimensional models within the three-dimensional space corresponding to the two-dimensional image in response to the three-dimensional human model comprising modified three-dimensional pose data. The series of actions 9000 further includes generating a modified two-dimensional image comprising a modified two-dimensional human from interactions between the modified three-dimensional human model and the additional three-dimensional model.

In at least some embodiments, the series of actions 9000 includes generating a cropped image corresponding to a boundary of a two-dimensional human in the two-dimensional image. The series of actions 9000 includes extracting two-dimensional pose data from the cropped image using one or more neural networks.

In one or more embodiments, the series of actions 9000 includes extracting, with one or more neural networks, two-dimensional gesture data corresponding to a two-dimensional skeleton of a two-dimensional human extracted from a two-dimensional image. In some embodiments, the series of actions 9000 includes extracting three-dimensional pose data and three-dimensional shape data corresponding to a three-dimensional skeleton of a two-dimensional human extracted from a two-dimensional image using one or more neural networks. The series of actions 9000 further includes generating, within a three-dimensional space corresponding to the two-dimensional image, a three-dimensional human model representing a two-dimensional human by refining a three-dimensional skeleton of the three-dimensional pose data from the two-dimensional pose data and the two-dimensional skeleton of the three-dimensional shape data.

In one or more embodiments, the series of actions 9000 includes extracting two-dimensional pose data from a two-dimensional image using a first neural network of one or more neural networks. The series of actions 9000 further includes extracting three-dimensional pose data and three-dimensional shape data using a second neural network of the one or more neural networks.

In some embodiments, the series of actions 9000 includes generating a body bounding box corresponding to a body part of a two-dimensional human. The series of actions 9000 includes extracting three-dimensional gesture data corresponding to a two-dimensional human body part using a neural network according to a body bounding box.

In some embodiments, the series of actions 9000 includes generating one or more hand bounding boxes corresponding to one or more hands of a two-dimensional human. The series of actions 9000 further includes extracting additional three-dimensional pose data corresponding to one or more hands of a two-dimensional human using an additional neural network according to one or more hand bounding boxes.

In one or more embodiments, the series of actions 9000 includes combining three-dimensional pose data corresponding to a body portion of a two-dimensional human and additional three-dimensional pose data corresponding to one or more hands of the two-dimensional human. The series of actions 9000 includes iteratively modifying a position of a bone in a three-dimensional skeleton based on the position of the bone in the two-dimensional skeleton.

In one or more embodiments, the series of actions 9000 includes modifying a pose of a three-dimensional human model within a three-dimensional space. The series of actions 9000 further includes generating a modified pose of the two-dimensional human within the two-dimensional image from the pose of the three-dimensional human model in three-dimensional space. The series of actions 9000 further includes generating, with the one or more neural networks, a modified two-dimensional image comprising a modified pose according to the modified two-dimensional human and a camera position associated with the two-dimensional image.

According to one or more embodiments, the series of actions 9000 includes extracting two-dimensional pose data from a two-dimensional human being extracted from a two-dimensional image using one or more neural networks. The series of actions 9000 includes extracting three-dimensional pose data and three-dimensional shape data corresponding to a two-dimensional human being extracted from a two-dimensional image using one or more neural networks. Further, the series of actions 9000 includes generating a three-dimensional human model representing a two-dimensional human in a three-dimensional space corresponding to the two-dimensional image by combining the two-dimensional pose data with the three-dimensional pose data and the three-dimensional shape data.

In one or more embodiments, the series of actions 9000 includes extracting two-dimensional gesture data, including extracting a two-dimensional skeleton from a cropped portion of the two-dimensional image using a first neural network of the one or more neural networks. The series of actions 9000 includes extracting three-dimensional pose data, including extracting a three-dimensional skeleton from a cropped portion of a two-dimensional image using a second neural network of one or more neural networks.

In at least some embodiments, the series of actions 9000 includes extracting, with a first neural network, a first three-dimensional skeleton corresponding to a first portion of a two-dimensional human. In one or more embodiments, the series of acts 9000 includes extracting, with a second neural network, a second three-dimensional skeleton corresponding to a second portion of the two-dimensional human including the hand.

In one or more embodiments, the series of actions 9000 includes iteratively modifying a position of a bone of a second three-dimensional skeleton according to a position of the bone of the first three-dimensional skeleton within the three-dimensional space to merge the first three-dimensional skeleton and the second three-dimensional skeleton. The series of actions 9000 includes iteratively modifying a position of a bone in a first three-dimensional skeleton according to a position of a bone of a two-dimensional skeleton from two-dimensional pose data.

Turning now to FIG. 91, a flow diagram of a series of actions 9100 for modifying a two-dimensional image based on modifying a pose of a three-dimensional human model of a two-dimensional human representing the two-dimensional image is shown. While FIG. 91 illustrates acts in accordance with one embodiment, alternative embodiments may omit, add, reorder, and/or modify any of the acts illustrated in FIG. 91. The actions of FIG. 91 are part of a method. Alternatively, the non-transitory computer-readable medium includes instructions that, when executed by the one or more processors, cause the one or more processors to perform the actions of fig. 91. In yet another embodiment, the system includes a processor or server configured to perform the actions of FIG. 91.

As shown, the series of actions 9100 includes an action 9102 of generating an interaction indicator for modifying a gesture of a two-dimensional human in a two-dimensional image. Further, the series of actions 9100 includes an action 9104 of modifying a pose of a three-dimensional human model representing a two-dimensional human. The series of actions 9100 further includes an action 9106 of generating a modified two-dimensional image based on the modified pose of the three-dimensional human model.

In one or more embodiments, the series of actions 9100 includes generating an interactive indicator having a two-dimensional image related to modifying a gesture of a two-dimensional human in the two-dimensional image for display within a graphical user interface of a client device. In some embodiments, the series of actions 9100 includes modifying a pose of a three-dimensional human model representing a two-dimensional human in response to interaction with the interactive indicator. In at least some embodiments, the series of actions 9100 includes generating a modified two-dimensional image from the pose of the three-dimensional human model, the modified two-dimensional image including a modified two-dimensional human in the two-dimensional image.

According to one or more embodiments, the series of actions 9100 includes generating a three-dimensional human model in three-dimensional space from a two-dimensional human in a two-dimensional image using one or more neural networks. The series of actions 9100 includes generating an interactive indicator in response to generating a three-dimensional human model in three-dimensional space.

In one or more embodiments, the series of actions 9100 includes determining camera positions corresponding to two-dimensional images using one or more neural networks. The series of actions 9100 includes inserting a three-dimensional human model at a position within a three-dimensional space based on a camera position corresponding to the two-dimensional image. Further, the series of actions 9100 includes providing a three-dimensional human model for display with the two-dimensional image within a graphical user interface of the client device.

According to one or more embodiments, the series of actions 9100 includes extracting two-dimensional gesture data from a two-dimensional human in a two-dimensional image using one or more neural networks. The series of actions 9100 includes extracting three-dimensional pose data from a two-dimensional human in a two-dimensional image using one or more neural networks. The series of actions 9100 further includes generating a three-dimensional human model based on the two-dimensional pose data and the three-dimensional pose data at a location within the three-dimensional space.

In one or more embodiments, the series of actions 9100 includes providing a three-dimensional human model as an overlay within a graphical user interface of the client device based on a position of the two-dimensional human in the two-dimensional image.

In some embodiments, the series of actions 9100 includes determining an initial pose of the three-dimensional human model based on the pose of the two-dimensional human in the two-dimensional image. The series of actions 9100 includes modifying the pose of the three-dimensional human model based on the initial pose of the three-dimensional human model and the interaction with the interaction indicator.

In one or more embodiments, the series of actions 9100 includes determining a range of motion of one or more portions of the three-dimensional human model from an initial pose of the three-dimensional human model and a target pose of the three-dimensional human model. Further, the series of actions 9100 includes providing corresponding ranges of motion of one or more corresponding portions of the two-dimensional human being for display within a graphical user interface of the client device in connection with interaction with the interaction indicator.

In some embodiments, the series of actions 9100 includes generating a modified two-dimensional human based on the pose of the three-dimensional human model and the initial texture of the two-dimensional human. In one or more embodiments, the series of actions 9100 includes generating a modified two-dimensional image including a modified two-dimensional human.

Further, the series of actions 9100 includes determining interactions between the three-dimensional human model and a three-dimensional object corresponding to the two-dimensional object in the two-dimensional image within the three-dimensional space and based on the pose of the three-dimensional human model. The series of actions 9100 further includes generating a modified two-dimensional image representing an interaction between the modified two-dimensional human being and the two-dimensional object from the interaction between the three-dimensional human model and the three-dimensional object.

In one or more embodiments, the series of actions 9100 includes generating a three-dimensional human model representing a two-dimensional human of a two-dimensional image in three-dimensional space for display within a graphical user interface of a client device. The series of actions 9100 includes generating an interactive indicator related to modifying a pose of a three-dimensional human model representing a two-dimensional human in a two-dimensional image for display within a graphical user interface of a client device. Further, the series of actions 9100 includes modifying a pose of a three-dimensional human model representing a two-dimensional human in three-dimensional space in response to interaction with the interactive indicator. The series of actions 9100 further includes generating a modified two-dimensional image from the pose of the three-dimensional human model, the modified two-dimensional image including a modified two-dimensional human in the two-dimensional image.

In one or more embodiments, the series of actions 9100 includes generating a three-dimensional human model in three-dimensional space using a plurality of neural networks according to a position of a two-dimensional human in a scene of a two-dimensional image. The series of actions 9100 includes generating an interactive indicator comprising one or more controls for modifying one or more portions of a three-dimensional human model within a three-dimensional space.

Further, the series of actions 9100 includes modifying a pose of the three-dimensional human model by modifying a pose of a portion of the three-dimensional human model according to one or more controls in response to interaction with the interactive indicator. The series of actions 9100 includes modifying, within a graphical user interface of the client device, a gesture of a portion of the two-dimensional human corresponding to the portion of the three-dimensional human model in conjunction with interaction with the interactive indicator.

According to one or more embodiments, the series of actions 9100 includes determining a motion constraint associated with the portion of the three-dimensional human model based on a pose prior corresponding to the portion of the three-dimensional human model. The series of actions 9100 includes modifying the portion of the three-dimensional human model according to motion constraints.

In one or more embodiments, the series of actions 9100 includes determining an initial texture corresponding to the three-dimensional human model based on the two-dimensional human of the two-dimensional image. The series of actions 9100 includes generating a modified two-dimensional image from the pose of the three-dimensional human model and the initial texture corresponding to the three-dimensional human model using a neural network.

In some embodiments, the series of actions 9100 includes determining a background region of the two-dimensional image occluded by the two-dimensional human from an initial pose of the two-dimensional human. The series of actions 9100 further includes generating a repair area for the background area of the two-dimensional image to display within the graphical user interface of the client device in connection with modifying the gesture of the two-dimensional human.

According to some embodiments, the series of actions 9100 includes determining a modified shape of the three-dimensional human model within the three-dimensional space in response to additional interactions with the additional interaction indicators. Further, the series of actions 9100 includes generating a modified two-dimensional image including a modified two-dimensional human in the two-dimensional image from the pose of the three-dimensional human model and the modified shape of the three-dimensional human model.

In some embodiments, the series of actions 9100 includes generating an interactive indicator having a two-dimensional image related to modifying a gesture of a two-dimensional human in the two-dimensional image for display within a graphical user interface of the client device. The series of actions 9100 further includes modifying a pose of a three-dimensional human model representing a two-dimensional human in response to interaction with the interactive indicator. Further, the series of actions 9100 includes generating a modified two-dimensional image from the pose of the three-dimensional human model, the modified two-dimensional image including a modified two-dimensional human in the two-dimensional image.

In at least some embodiments, the series of actions 9100 includes generating a three-dimensional human model within a three-dimensional space using a plurality of neural networks from three-dimensional poses and three-dimensional shapes extracted from a two-dimensional human in a two-dimensional image. The series of actions 9100 further includes generating an interactive indicator in response to generating a three-dimensional human model in three-dimensional space.

In some embodiments, the series of actions 9100 includes modifying a pose of the three-dimensional human model, including determining a request to change an initial pose of the three-dimensional human model to a target pose of the three-dimensional human model in response to interaction with the interactive indicator. The series of actions 9100 further includes generating a modified two-dimensional image including modifying a pose of the two-dimensional human based on the initial pose of the three-dimensional human model and the target pose of the three-dimensional human model.

In one or more embodiments, the series of actions 9100 includes generating a three-dimensional human model based on an initial pose of a two-dimensional human in a two-dimensional image using a plurality of neural networks. The series of actions 9100 includes providing the three-dimensional human model and the interaction indicator as overlays in the two-dimensional image at a location corresponding to the two-dimensional human in the two-dimensional image for display within a graphical user interface of the client device.

Turning now to fig. 92, a flow chart of a series of actions 9200 for transforming a planar surface of an object in a two-dimensional image based on a three-dimensional representation of the two-dimensional image is shown. Although FIG. 92 illustrates acts in accordance with one embodiment, alternative embodiments may omit, add, reorder, and/or modify any of the acts illustrated in FIG. 92. The actions of FIG. 92 are part of a method. Alternatively, the non-transitory computer-readable medium includes instructions that, when executed by the one or more processors, cause the one or more processors to perform the actions of fig. 92. In yet another embodiment, the system includes a processor or server configured to perform the actions of FIG. 92.

As shown, the series of actions 9200 includes an action 9202 of determining a three-dimensional position value relative to an object in a three-dimensional representation of a scene. Further, the series of actions 9200 includes an action 9204 of generating a planar surface corresponding to the three-dimensional position value in association with the modification object. The series of actions 9200 also includes an action 9206 of providing a portion of the planar surface for display via a graphical user interface.

In one or more embodiments, the series of actions 9200 includes determining a three-dimensional position value for a portion of an object of one or more objects relative to a three-dimensional representation of a scene on one or more axes within a three-dimensional space. In some embodiments, the series of actions 9200 includes generating a planar surface corresponding to a three-dimensional position value relative to a portion of the object in one or more axes in three-dimensional space in connection with modifying the object in three-dimensional space. According to one or more embodiments, the series of actions 9200 includes providing a portion of a planar surface for display via a graphical user interface of a client device.

In some embodiments, the series of actions 9200 includes generating a planar surface on one or more axes within a three-dimensional space in response to selection of an object via a graphical user interface.

According to one or more embodiments, the series of actions 9200 includes generating a partially transparent texture for the portion of the planar surface for display within a graphical user interface of the client device.

In one or more embodiments, the series of actions 9200 includes detecting movement of the object from a first position to a second position within the three-dimensional space. The series of actions 9200 further includes modifying the position of the planar surface from the first position to the second position in response to movement of the object.

In some embodiments, the series of actions 9200 includes determining a portion of the object from a position of the object on one or more axes within the three-dimensional space in response to selection of the object via the graphical user interface.

In one or more embodiments, the series of actions 9200 includes modifying a visual characteristic of the portion of the planar surface in response to detecting a change in the position of the object relative to the planar surface along the axis of the one or more axes.

According to one or more embodiments, the series of actions 9200 includes providing an option for display within a graphical user interface to capture a position of an object to a nearest surface along an axis of one or more axes within a three-dimensional space. The series of actions 9200 includes moving the object along an axis in one or more axes within the three-dimensional space to a position adjacent to the nearest surface in response to selection of the option. The series of actions 9200 further includes modifying a position of the planar surface or a texture of the planar surface in response to moving the object to a position adjacent to the nearest surface.

In one or more embodiments, the series of actions 9200 includes determining a size or shape of a planar surface for display via a graphical user interface based on the object.

In some embodiments, the series of actions 9200 includes determining that a distance between the object and an additional object within the three-dimensional space is below a threshold distance. The series of actions 9200 includes generating an additional planar surface corresponding to a surface of the additional object for display via a graphical user interface of the client device in response to the distance between the object and the additional object being below a threshold distance. The series of actions 9200 further includes determining a size of the visible portion of the additional planar surface based on a distance between the object and the additional object.

In one or more embodiments, the series of actions 9200 includes determining a three-dimensional representation of a scene comprising a two-dimensional image of one or more objects within a three-dimensional space. The series of actions 9200 further includes determining a three-dimensional position value for a portion of an object relative to one or more objects in a three-dimensional representation of the scene on one or more axes within the three-dimensional space. Further, the series of actions 9200 includes generating a planar surface corresponding to three-dimensional position values relative to portions of the object in one or more axes in three-dimensional space in conjunction with modifying the position of the object in three-dimensional space. In some embodiments, the series of actions 9200 includes providing a portion of a planar surface for display via a graphical user interface of a client device.

According to one or more embodiments, the series of actions 9200 includes generating, with one or more neural networks, one or more foreground three-dimensional meshes representing one or more foreground objects in a two-dimensional image. The series of actions 9200 further includes generating a background three-dimensional grid representing a background in the two-dimensional image using the one or more neural networks.

In one or more embodiments, the series of actions 9200 includes determining an input within the three-dimensional space that includes a selection of an object and an indication to move the object within the three-dimensional space. The series of actions 9200 includes modifying a position of the object and a position of the portion of the planar surface in three-dimensional space in response to the input.

In some embodiments, the series of actions 9200 includes determining an input within the three-dimensional space that includes a selection of a planar surface and an indication of moving the planar surface within the three-dimensional space. The series of actions 9200 includes modifying a position of the object and a position of the portion of the planar surface in three-dimensional space in response to the input.

In some embodiments, the series of actions 9200 includes determining a horizon for a scene corresponding to the two-dimensional image based on camera positions of the two-dimensional image. The series of actions 9200 further includes providing a portion of the planar surface for display according to a distance from a location of the planar surface to the horizon.

In one or more embodiments, the series of actions 9200 includes providing a portion of a planar surface having a first texture for display. The series of actions 9200 further includes providing that the portion of the planar surface is displayed with a second texture different from the first texture in response to an input modifying the position of the object in three-dimensional space.

The series of actions 9200 includes generating an object platform at the portion of the planar surface, the object platform indicating a position of the object relative to one or more planar axes corresponding to the planar surface, the object platform including a texture different from one or more additional portions of the planar surface. The series of actions 9200 further includes modifying a position of the object platform along the planar surface in response to modifying the position of the object along one or more planar axes corresponding to the planar surface.

In at least some embodiments, the series of actions 9200 includes determining a three-dimensional position value for a portion of an object of one or more objects relative to a three-dimensional representation of a scene on one or more axes within a three-dimensional space. The series of actions 9200 further includes generating a planar surface corresponding to the three-dimensional position value relative to the portion of the object in one or more axes in three-dimensional space in conjunction with modifying the position of the object in three-dimensional space. The series of actions 9200 further includes modifying a portion of a planar surface within a graphical user interface of the client device in response to modifying the position of the object within the three-dimensional space.

In one or more embodiments, the series of actions 9200 includes determining a three-dimensional position value for the portion of the object includes determining a lowest three-dimensional position value for the object along a vertical axis within the three-dimensional space. The series of actions 9200 further includes generating a planar surface including generating the planar surface along a horizontal axis perpendicular to the vertical axis at a lowest three-dimensional position value of the object along the vertical axis.

In some embodiments, the series of actions 9200 includes detecting a change in the position of the object along an axis of the one or more axes. The series of actions 9200 further includes modifying a position of the portion of the planar surface according to a change in the position of the object along the axis of the one or more axes.

Turning now to FIG. 93, a flow diagram of a series of actions 9300 for modifying a focus of a two-dimensional image based on a three-dimensional representation of the two-dimensional image is shown in accordance with one or more embodiments. Although FIG. 93 illustrates acts in accordance with one embodiment, alternative embodiments may omit, add, reorder, and/or modify any of the acts illustrated in FIG. 93. The acts of fig. 93 are part of a method. Alternatively, the non-transitory computer-readable medium includes instructions that, when executed by the one or more processors, cause the one or more processors to perform the actions of fig. 93. In yet another embodiment, the system includes a processor or server configured to perform the actions of FIG. 93.

As shown, the series of actions 9300 includes an action 9302 of generating a three-dimensional representation of a two-dimensional image. Further, the series of actions 9300 includes an action 9304 of determining a focus of the two-dimensional image based on a position of the input element within the three-dimensional representation. The series of actions 9300 further includes an action 9306 of generating a modified two-dimensional image comprising image blur based on the focus.

In one or more embodiments, the series of actions 9300 includes generating, by at least one processor, a three-dimensional representation of a two-dimensional image including one or more objects. According to one or more embodiments, the series of actions 9300 includes determining, by the at least one processor, a focus for the two-dimensional image from a camera position of the two-dimensional image based on a three-dimensional position of the input element within the three-dimensional representation of the two-dimensional image. In some embodiments, the series of actions 9300 includes generating, by the at least one processor, a modified two-dimensional image including image blur based on a focus corresponding to the three-dimensional position of the input element.

In at least some embodiments, the series of actions 9300 includes generating, with one or more neural networks, one or more foreground three-dimensional meshes corresponding to one or more foreground objects in the two-dimensional image. The series of actions 9300 includes generating a background three-dimensional mesh corresponding to one or more background objects in a two-dimensional image using one or more neural networks.

In some embodiments, the series of actions 9300 includes generating an input element in response to an input via a graphical user interface displaying a two-dimensional image, the input element including a three-dimensional object within a three-dimensional space including a three-dimensional representation. The series of actions 9300 further includes determining a focus based on three-dimensional coordinates of the three-dimensional object in three-dimensional space.

In one or more embodiments, the series of actions 9300 includes receiving an input for modifying three-dimensional coordinates of a three-dimensional object within a three-dimensional space. The series of actions 9300 includes determining modified three-dimensional coordinates and modified dimensions of a three-dimensional object within the three-dimensional space in response to the input. Further, the series of actions 9300 includes updating the focus based on the three-dimensional coordinates of the modified three-dimensional object.

According to one or more embodiments, the series of actions 9300 includes determining two-dimensional coordinates within an image space corresponding to an input via a graphical user interface. For example, the series of actions 9300 include determining a three-dimensional position by converting two-dimensional coordinates in an image space to three-dimensional coordinates within a three-dimensional space including a three-dimensional representation based on a depth map corresponding to the two-dimensional image.

In some embodiments, the series of actions 9300 includes determining depth values for identified pixels of the two-dimensional image from the three-dimensional position of the input element based on the depth map of the two-dimensional image. For example, the series of actions 9300 includes blurring pixels in one or more portions of the two-dimensional image with a blur filter based on differences between depth values of the identified pixels and depth values of pixels in the one or more portions of the two-dimensional image.

In one or more embodiments, the series of actions 9300 includes determining a three-dimensional depth based on a three-dimensional position of the input element and a position of the virtual camera within a three-dimensional space including the three-dimensional representation. The series of actions 9300 also includes modifying camera parameters of the virtual camera according to the three-dimensional depth.

According to one or more embodiments, the series of actions 9300 includes determining a portion of a two-dimensional image corresponding to a three-dimensional position of an input element. Further, the series of actions 9300 includes generating a modified two-dimensional image that is enlarged over the portion of the two-dimensional image by modifying a camera position of the camera within a three-dimensional space that includes the three-dimensional representation according to the portion of the two-dimensional image.

In one or more embodiments, the series of actions 9300 includes determining a range of movement of the input element from a first three-dimensional position to a second three-dimensional position within a three-dimensional space comprising the three-dimensional representation. Further, the series of actions 9300 include generating an animation that blurs different portions of the two-dimensional image for display within the graphical user interface based on a range of movement of the input element from the first three-dimensional position to the second three-dimensional position.

In one or more embodiments, the series of actions 9300 includes generating a three-dimensional representation of a two-dimensional image including one or more objects. The series of actions 9300 includes determining a three-dimensional position within a three-dimensional space including a three-dimensional representation of a two-dimensional image from input elements within a graphical user interface. Further, the series of actions 9300 include determining a focus of the two-dimensional image based on the three-dimensional position within the three-dimensional space by determining a depth associated with the three-dimensional position. The series of actions 9300 further includes generating a modified two-dimensional image by modifying an image blur of one or more portions of the two-dimensional image based on the focus.

In some embodiments, the series of actions 9300 includes generating one or more three-dimensional grids corresponding to one or more foreground objects or one or more background objects of the two-dimensional image using one or more neural networks.

In one or more embodiments, the series of actions 9300 includes determining a position of an input element within an image space of a two-dimensional image. Further, the series of actions 9300 includes determining a three-dimensional position within a three-dimensional space including a three-dimensional representation based on a mapping between the image space and the three-dimensional space.

For example, the series of actions 9300 include determining a modified position of an input element within an image space of a two-dimensional image from an input via a graphical user interface displaying the two-dimensional image. The series of actions 9300 further includes modifying a size of the input element and a three-dimensional position within the three-dimensional space in response to the modified position of the input element.

According to one or more embodiments, the series of actions 9300 include determining a depth associated with a three-dimensional location by determining a distance between the three-dimensional location and a camera location corresponding to a camera within the three-dimensional space. The series of actions 9300 further includes generating a modified two-dimensional image by modifying camera parameters corresponding to a camera within the three-dimensional space based on the three-dimensional position and the distance between the camera positions.

In one or more embodiments, the series of actions 9300 includes determining a focus of the two-dimensional image by determining pixels corresponding to three-dimensional locations within the three-dimensional space, and determining a depth associated with the three-dimensional locations based on depth values of pixels corresponding to the three-dimensional locations from a depth map of the two-dimensional image. Further, the series of actions 9300 include generating a modified two-dimensional image by applying a blur filter to additional pixels in the two-dimensional image based on differences in depth values of the additional pixels relative to the depth values of the pixels.

In some embodiments, the series of actions 9300 includes determining movement of the input element from a three-dimensional position within the three-dimensional space to an additional three-dimensional position within the three-dimensional space. Further, the series of actions 9300 include modifying blur values of pixels in the two-dimensional image within the graphical user interface as the input element moves from the three-dimensional position to the additional three-dimensional position according to the first three-dimensional depth of the three-dimensional position and the second three-dimensional depth of the additional three-dimensional position.

In at least some embodiments, the series of actions 9300 includes generating a three-dimensional representation of a two-dimensional image including one or more objects. Further, the series of actions 9300 includes determining a focus of the two-dimensional image based on a three-dimensional position of the input element within the three-dimensional representation of the two-dimensional image from a camera position of the two-dimensional image. In some embodiments, the series of actions 9300 includes generating a modified two-dimensional image including a partial image modification based on a focus corresponding to a three-dimensional position of the input element. For example, generating the modified two-dimensional image includes applying image blur to the content of the two-dimensional image according to the three-dimensional position of the input element.

In one or more embodiments, the series of actions 9300 includes generating a three-dimensional representation including generating one or more three-dimensional meshes corresponding to one or more objects in a two-dimensional image. The series of actions 9300 can further include determining a focus includes determining that the three-dimensional position of the input element corresponds to a three-dimensional depth of a three-dimensional grid of the one or more three-dimensional grids.

According to one or more embodiments, the series of actions 9300 includes determining camera parameters of a camera within a three-dimensional space including a three-dimensional representation based on three-dimensional depths of a three-dimensional mesh of the one or more three-dimensional meshes. The series of actions 9300 also includes generating, with the three-dimensional renderer, a modified two-dimensional image according to the camera parameters.

In at least some embodiments, the series of actions 9300 includes generating an input element that includes a three-dimensional object within a three-dimensional space that includes a three-dimensional representation of a two-dimensional image. The series of actions 9300 further includes determining a focus based on three-dimensional coordinates of the three-dimensional object in three-dimensional space.

Embodiments of the present disclosure may include or utilize a special purpose or general-purpose computer including computer hardware, such as one or more processors and system memory, as discussed in more detail below. Embodiments within the scope of the present disclosure also include physical and other computer-readable media for carrying or storing computer-executable instructions and/or data structures. In particular, one or more processes described herein may be implemented at least in part as instructions embodied in a non-transitory computer readable medium and executable by one or more computing devices (e.g., any of the media content access devices described herein). In general, a processor (e.g., a microprocessor) receives instructions from a non-transitory computer-readable medium (e.g., memory) and executes the instructions, thereby executing one or more processes, including one or more processes described herein.

Computer readable media can be any available media that can be accessed by a general purpose or special purpose computer system. The computer-readable medium storing computer-executable instructions is a non-transitory computer-readable storage medium (device). The computer-readable medium carrying computer-executable instructions is a transmission medium. Thus, by way of example, and not limitation, embodiments of the present disclosure may include at least two distinct computer-readable media: a non-transitory computer readable storage medium (device) and a transmission medium.

Non-transitory computer readable storage media (devices) include RAM, ROM, EEPROM, CD-ROM, solid State Drives (SSDs) (e.g., based on RAM), flash memory, phase Change Memory (PCM), other types of memory, other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to store desired program code means in the form of computer-executable instructions or data structures and that can be accessed by a general purpose or special purpose computer.

A network is defined as one or more data links capable of transmitting electronic data between computer systems and/or modules and/or other electronic devices. When information is transferred or provided over a network or another communications connection (either hardwired, wireless, or a combination of hardwired or wireless) to a computer, the computer properly views the connection as a transmission medium. The transmission media can include networks and/or data links, which can be used to carry desired program code means in the form of computer-executable instructions or data structures, and which can be accessed by a general purpose or special purpose computer. Combinations of the above should also be included within the scope of computer-readable media.

Furthermore, upon reaching various computer system components, program code means in the form of computer-executable instructions or data structures can be automatically transferred from transmission media to non-transitory computer-readable storage media (devices) (or vice versa). For example, computer-executable instructions or data structures received over a network or data link may be cached in RAM within a network interface module (e.g., NIC), and then ultimately transferred to computer system RAM and/or less volatile computer storage media (devices) at a computer system. Thus, it should be understood that a non-transitory computer readable storage medium (device) can be included in a computer system component that also (or even primarily) utilizes transmission media.

Computer-executable instructions comprise, for example, instructions and data which, when executed by a processor, cause a general purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions. In some embodiments, computer-executable instructions are executed on a general-purpose computer to transform the general-purpose computer into a special-purpose computer that implements the elements of the present disclosure. The computer-executable instructions may be, for example, binary, intermediate format instructions such as assembly language, or even source code. Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the features or acts described above. Rather, the described features and acts are disclosed as example forms of implementing the claims.

Those skilled in the art will appreciate that the invention may be practiced in network computing environments with many types of computer system configurations, including personal computers, desktop computers, laptop computers, message processors, hand-held devices, multiprocessor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, mobile telephones, PDAs, tablet computers, pagers, routers, switches, and the like. The invention may also be practiced in distributed system environments where local and remote computer systems perform tasks over a network link (either by hardwired data links, wireless data links, or by a combination of hardwired and wireless data links). In a distributed system environment, program modules may be located in both local and remote memory storage devices.

Embodiments of the present disclosure may also be implemented in a cloud computing environment. In this specification, "cloud computing" is defined as a model for enabling on-demand network access to a shared pool of configurable computing resources. For example, cloud computing may be used in the marketplace to provide ubiquitous and convenient on-demand access to a shared pool of configurable computing resources. The shared pool of configurable computing resources can be quickly deployed by virtualization, and can be released with little management effort or service provider interaction, and then expanded accordingly.

The cloud computing model may be composed of various features such as on-demand self-service, wide network access, resource pools, rapid elasticity, measurement services, and the like. The cloud computing model may also disclose various service models, such as software as a service ("SaaS"), platform as a service ("PaaS"), and infrastructure as a service ("IaaS"). The cloud computing model may also be deployed using different deployment models, such as private cloud, community cloud, public cloud, hybrid cloud, and so forth. In this specification and in the claims, a "cloud computing environment" is an environment that uses cloud computing.

Fig. 94 illustrates a block diagram of an exemplary computing device 9400, which can be configured to perform one or more of the processes described above. It is to be appreciated that one or more computing devices, such as computing device 9400 can represent computing devices described above (e.g., server(s) 102 and/or client devices 110a-110 n). In one or more embodiments, the computing device 9400 can be a mobile device (e.g., a mobile phone, a smartphone, a PDA, a tablet, a laptop, a camera, a tracker, a watch, a wearable device). In some embodiments, computing device 9400 can be a non-mobile device (e.g., a desktop computer or another type of client device). Further, computing device 9400 can be a server device that includes cloud-based processing and storage capabilities.

As shown in fig. 94, computing device 9400 can include one or more processors 9402, memory 9404, storage devices 9406, input/output interfaces 9408 (or "I/O interfaces 9408"), and communication interfaces 9410, which can be communicatively coupled via a communication infrastructure (e.g., bus 9412). Although computing device 9400 is shown in fig. 94, the components shown in fig. 94 are not intended to be limiting. Additional or alternative components may be used in other embodiments. Moreover, in certain embodiments, computing device 9400 includes fewer components than shown in FIG. 94. The components of computing device 9400 shown in fig. 94 will now be described in more detail.

In a particular embodiment, the processor(s) 9402 include hardware for executing instructions, such as those instructions that make up a computer program. By way of example, and not limitation, to execute instructions, processor(s) 9402 may retrieve (or fetch) instructions from internal registers, internal caches, memory 9404, or storage device 9406, and decode and execute them.

The computing device 9400 includes memory 9404 coupled to processor(s) 9402. The memory 9404 may be used for storing data, metadata, and programs for execution by the processor(s). Memory 9404 can include one or more of volatile and nonvolatile memory, such as random access memory ("RAM"), read only memory ("ROM"), solid state disk ("SSD"), flash memory, phase change memory ("PCM"), or other types of data storage. Memory 9404 may be internal or distributed memory.

Computing device 9400 includes a storage device 9406, storage device 9406 includes memory for storing data or instructions. By way of example, and not limitation, storage device 9406 may comprise non-transitory storage media as described above. Storage 9406 may include a Hard Disk Drive (HDD), flash memory, a Universal Serial Bus (USB) drive, or a combination of these or other storage devices.

As shown, computing device 9400 includes one or more I/O interfaces 9408 that are provided to allow a user to provide input (e.g., user strokes) to computing device 9400, to receive output from computing device 9400, and to otherwise transmit data to computing device 9400 and from computing device 9400. These I/O interfaces 9408 can include a mouse, a keypad or keyboard, a touch screen, a camera, an optical scanner, a network interface, a modem, other known I/O devices, or a combination of these I/O interfaces 9408. The touch screen may be activated with a stylus or finger.

The I/O interface 9408 can include one or more devices for presenting output to a user, including but not limited to a graphics engine, a display (e.g., a display screen), one or more output drivers (e.g., a display driver), one or more audio speakers, and one or more audio drivers. In some embodiments, the I/O interface 9408 is configured to provide graphical data to a display for presentation to a user. The graphical data may represent one or more graphical user interfaces and/or any other graphical content that may serve a particular implementation.

Computing device 9400 can also include a communication interface 9410. Communication interface 9410 may include hardware, software, or both. Communication interface 9410 provides one or more interfaces for communication (e.g., packet-based communication) between the computing device and one or more other computing devices or one or more networks. By way of example, and not limitation, communication interface 9410 may include a Network Interface Controller (NIC) or network adapter to communicate with an ethernet or other wire-based network, or a Wireless NIC (WNIC) or wireless adapter to communicate with a wireless network, such as Wi-Fi. The computing device 9400 can also include a bus 9412. The bus 9412 may include hardware, software, or both that connects the components of the computing device 9400 to one another.

In the foregoing specification, the invention has been described with reference to specific exemplary embodiments thereof. Various embodiments and aspects of the invention are described with reference to details discussed herein and the accompanying drawings illustrate the various embodiments. The foregoing description and drawings are illustrative of the invention and are not to be construed as limiting the invention. Numerous specific details are described to provide a thorough understanding of various embodiments of the invention.

The present invention may be embodied in other specific forms without departing from its spirit or essential characteristics. The described embodiments are to be considered in all respects only as illustrative and not restrictive. For example, the methods described herein may be performed with fewer or more steps/acts, or the steps/acts may be performed in a different order. Furthermore, the steps/acts described herein may be repeated or performed in parallel with each other, or with different instances of the same or similar steps/acts. The scope of the invention is, therefore, indicated by the appended claims rather than by the foregoing description. All changes which come within the meaning and range of equivalency of the claims are to be embraced within their scope.

Claims

1. A computer-implemented method, comprising:

generating, by at least one processor, an interactive indicator having a two-dimensional image for display within a graphical user interface of a client device, the interactive indicator relating to modifying a gesture of a two-dimensional human in the two-dimensional image;

modifying, by the at least one processor, a pose of a three-dimensional human model representing the two-dimensional human in response to interaction with the interaction indicator; and

Generating, by the at least one processor, a modified two-dimensional image from the pose of the three-dimensional human model, the modified two-dimensional image comprising a modified two-dimensional human in the two-dimensional image.

2. The computer-implemented method of claim 1, wherein generating the interaction indicator comprises:

generating the three-dimensional human model in three-dimensional space using one or more neural networks from the two-dimensional human in the two-dimensional image; and

the interaction indicator is generated in response to generating the three-dimensional human model within the three-dimensional space.

3. The computer-implemented method of claim 2, wherein generating the three-dimensional human model comprises:

determining, with the one or more neural networks, a camera position corresponding to the two-dimensional image;

inserting the three-dimensional human model into a position within the three-dimensional space based on the camera position corresponding to the two-dimensional image; and

the three-dimensional human model is provided for display with the two-dimensional image within the graphical user interface of the client device.

4. The computer-implemented method of claim 3, wherein generating the three-dimensional human model comprises:

Extracting two-dimensional pose data from the two-dimensional human in two-dimensional images using one or more neural networks;

extracting three-dimensional pose data from the two-dimensional human in the two-dimensional image using one or more neural networks; and

the three-dimensional human model is generated at the location within the three-dimensional space based on the two-dimensional pose data and the three-dimensional pose data.

5. The computer-implemented method of claim 2, wherein generating the three-dimensional human model includes providing the three-dimensional human model as an overlay within the graphical user interface of the client device based on a position of the two-dimensional human in the two-dimensional image.

6. The computer-implemented method of claim 1, wherein modifying the pose of the three-dimensional human model comprises:

determining an initial pose of the three-dimensional human model based on the pose of the two-dimensional human in the two-dimensional image; and

based on the initial pose of the three-dimensional human model and the interaction with the interaction indicator, the pose of the three-dimensional human model is modified.

7. The computer-implemented method of claim 6, wherein generating the modified two-dimensional image comprises:

Determining a range of motion of one or more portions of the three-dimensional human model from the initial pose of the three-dimensional human model and a target pose of the three-dimensional human model; and

a corresponding range of motion of one or more corresponding portions of the two-dimensional human being is provided for display within the graphical user interface of the client device in connection with the interaction of the interaction indicator.

8. The computer-implemented method of claim 1, wherein generating the modified two-dimensional image comprises:

generating the modified two-dimensional human based on the pose of the three-dimensional human model and an initial texture of the two-dimensional human; and

the modified two-dimensional image is generated that includes the modified two-dimensional human.

9. The computer-implemented method of claim 1, wherein generating the modified two-dimensional image comprises:

determining, within a three-dimensional space and based on the pose of the three-dimensional human model, interactions between the three-dimensional human model and a three-dimensional object corresponding to a two-dimensional object in the two-dimensional image; and

from the interaction between the three-dimensional human model and the three-dimensional object, the modified two-dimensional image representing the interaction between the modified two-dimensional human and the two-dimensional object is generated.

10. A system, comprising:

one or more memory devices comprising a two-dimensional image; and

one or more processors configured to cause the system to:

generating a three-dimensional human model representing a two-dimensional human of the two-dimensional image in three-dimensional space for display within a graphical user interface of a client device;

generating an interactive indicator for display within a graphical user interface of the client device, the interactive indicator relating to modifying a pose of the three-dimensional human model representing the two-dimensional human in the two-dimensional image;

modifying a pose representing a three-dimensional human model of the two-dimensional human in the three-dimensional space in response to interaction with the interaction indicator; and

a modified two-dimensional image is generated that includes a modified two-dimensional human of the two-dimensional images from the pose of the three-dimensional human model.

11. The system of claim 10, wherein the one or more processors are configured to cause the system to:

generating a three-dimensional human model in a three-dimensional space according to the positions of the two-dimensional human in the scene of the two-dimensional image by using a plurality of neural networks; and

The interactive indicator is generated that includes one or more controls for modifying one or more portions of the three-dimensional human model within the three-dimensional space.

12. The system of claim 11, wherein the one or more processors are configured to cause the system to:

modifying the pose of the three-dimensional human model by modifying a pose of a portion of the three-dimensional human model according to the one or more controls in response to the interaction with the interaction indicator; and

within the graphical user interface of the client device, a gesture of a portion of the two-dimensional human corresponding to the portion of the three-dimensional human model is modified in conjunction with the interaction indicator.

13. The system of claim 10, wherein the one or more processors are configured to cause the system to modify the pose of the three-dimensional human model by:

determining a motion constraint associated with a portion of the three-dimensional human model based on a pose prior corresponding to the portion of the three-dimensional human model; and

modifying the portion of the three-dimensional human model according to the motion constraints.

14. The system of claim 10, wherein the one or more processors are configured to cause the system to generate the modified two-dimensional image by:

determining an initial texture corresponding to the three-dimensional human model based on the two-dimensional human of the two-dimensional image; and

a modified two-dimensional image is generated from the pose of the three-dimensional human model and the initial texture corresponding to the three-dimensional human model using a neural network.

15. The system of claim 10, wherein the one or more processors are configured to cause the system to generate the modified two-dimensional image by:

determining a background region of the two-dimensional image blurred by the two-dimensional human from an initial pose of the two-dimensional human; and

a repair area for the background area of the two-dimensional image is generated in connection with modifying the pose of the two-dimensional human for display within the graphical user interface of the client device.

16. The system of claim 10, wherein the one or more processors are configured to cause the system to generate the modified two-dimensional image by:

Determining a modified shape of the three-dimensional human model within the three-dimensional space in response to additional interactions with additional interaction indicators; and

generating the modified two-dimensional image comprising the modified two-dimensional human of the two-dimensional images from the pose of the three-dimensional human model and the modified shape of the three-dimensional human model.

17. A non-transitory computer-readable medium storing executable instructions that, when executed by a processing device, cause the processing device to perform operations comprising:

generating an interaction indicator for display within a graphical user interface of a client device having a two-dimensional image, the interaction indicator being related to modifying a gesture of a two-dimensional human in the two-dimensional image;

modifying a pose of a three-dimensional human model representing the two-dimensional human in response to interaction with the interaction indicator; and

18. The non-transitory computer-readable medium of claim 17, wherein generating the interaction indicator comprises:

Generating the three-dimensional human model within the three-dimensional space using a plurality of neural networks according to the three-dimensional pose and the three-dimensional shape extracted from the two-dimensional human in the two-dimensional image; and

19. The non-transitory computer readable medium of claim 17, wherein:

modifying the pose of the three-dimensional human model includes determining, in response to the interaction with the interaction indicator, a request to change an initial pose of the three-dimensional human model to a target pose of the three-dimensional human model; and

generating the modified two-dimensional image includes modifying a pose of the two-dimensional human based on the initial pose of the three-dimensional human model and the target pose of the three-dimensional human model.

20. The non-transitory computer readable medium of claim 17, further comprising:

generating the three-dimensional human model based on an initial pose of the two-dimensional human in the two-dimensional image using a plurality of neural networks; and

the three-dimensional human model and the interactive indicator are provided as overlays in the two-dimensional image at locations corresponding to the two-dimensional human in the two-dimensional image for display within the graphical user interface of the client device.