CN117853611A

CN117853611A - Modifying digital images via depth aware object movement

Info

Publication number: CN117853611A
Application number: CN202311286086.4A
Authority: CN
Inventors: 丁志宏; S·科恩; M·乔斯; 张健明; D·普拉萨德; C·戈梅斯; J·布兰特
Original assignee: Adobe Systems Inc
Current assignee: Adobe Inc
Priority date: 2022-10-06
Filing date: 2023-10-07
Publication date: 2024-04-09

Abstract

Embodiments of the present disclosure relate to modifying a digital image via depth aware object movement. The present disclosure relates to systems, methods, and non-transitory computer-readable media implementing perspective aware object movement operations for digital image editing. For example, in some embodiments, the disclosed system determines vanishing points associated with a digital image depicting an object. In addition, the disclosed system detects one or more user interactions for moving an object within a digital image. Based on moving the object relative to the vanishing point, the disclosed system performs perspective-based resizing of the object within the digital image.

Description

Modifying digital images via depth aware object movement

Cross Reference to Related Applications

The priority and benefit of U.S. provisional patent application No. 63/378,616, filed on 6, 10, 2022, is hereby incorporated by reference in its entirety. The present application claims priority from U.S. patent application Ser. No. 18/320,664, filed on day 2023, month 5 and day 19, which is a partial continuation application of U.S. patent application Ser. No. 18/190,500, filed on day 2023, month 3 and day 27, which is a partial continuation application of the following patent applications: U.S. patent application Ser. No. 18/058,538 filed on 11 month 23 of 2022, U.S. patent application Ser. No. 18/058,554 filed on 11 month 23 of 2022, U.S. patent application Ser. No. 18/058,575 filed on 11 month 23 of 2022, U.S. patent application Ser. No. 18/058,601 filed on 11 month 23 of 2022, U.S. patent application Ser. No. 18/058,622 filed on 11 month 23 of 2022, and U.S. patent application Ser. No. 18/058,630 filed on 11 month 23 of 2022. The above application is incorporated by reference in its entirety.

Background

In recent years, there has been a significant advancement in hardware and software platforms for performing computer vision and image editing tasks. In practice, the system provides various image-related tasks such as object identification, classification, segmentation, composition, style transfer, image restoration, and the like.

Disclosure of Invention

One or more embodiments described herein provide benefits and/or solve one or more problems in the art with systems, methods, and non-transitory computer-readable media that implement artificial intelligence models to facilitate flexible and efficient scene-based image editing. To illustrate, in one or more embodiments, the system utilizes one or more machine learning models to learn/identify characteristics of digital images, predict potential edits to digital images, and/or generate supplemental components that may be used in various edits. Thus, the system obtains an understanding of a two-dimensional digital image as if it were a real scene, with different semantic regions reflecting real world (e.g., three-dimensional) conditions. Furthermore, the system enables two-dimensional digital images to be edited such that changes automatically and consistently reflect corresponding real world conditions without relying on additional user input.

Additional features and advantages of one or more embodiments of the disclosure are summarized in the description that follows, and in part will be apparent from the description, or may be learned by practice of such example embodiments.

Drawings

The present disclosure will describe one or more embodiments of the invention with additional specificity and detail through use of the accompanying drawings. The following paragraphs briefly introduce the drawings, in which:

FIG. 1 illustrates an example environment in which a scene-based image editing system operates in accordance with one or more embodiments;

FIG. 2 illustrates an overview of a scene-based image editing system editing a digital image into a real scene in accordance with one or more embodiments;

FIG. 3 illustrates a segmented neural network used by a scene-based image editing system to generate object masks for objects in accordance with one or more embodiments;

FIG. 4 illustrates the use of a cascade modulation repair neural network to generate a repair digital image in accordance with one or more embodiments;

FIG. 5 illustrates an example architecture of a cascade modulation repair neural network in accordance with one or more embodiments;

FIG. 6 illustrates a global modulation block and a spatial modulation block implemented in a cascaded modulation repair neural network in accordance with one or more embodiments;

FIG. 7 shows a diagram for generating object masking and content population to facilitate object-aware modification of a digital image in accordance with one or more embodiments;

8A-8D illustrate a graphical user interface implemented by a scene-based image editing system to facilitate mobile operations in accordance with one or more embodiments;

9A-9C illustrate a graphical user interface implemented by a scene-based image editing system to facilitate delete operations in accordance with one or more embodiments;

FIG. 10 illustrates an image analysis diagram for use by a scene-based image editing system in generating a semantic scene graph in accordance with one or more embodiments;

FIG. 11 illustrates a real world category description map for use by a scene-based image editing system in generating a semantic scene map in accordance with one or more embodiments;

FIG. 12 illustrates a behavior policy diagram for use by a scene-based image editing system in generating a semantic scene graph in accordance with one or more embodiments;

FIG. 13 illustrates a semantic scene graph of generating a digital image by a scene-based image editing system in accordance with one or more embodiments;

FIG. 14 shows an illustration of a semantic scene graph for generating a digital image using a template graph in accordance with one or more embodiments;

FIG. 15 shows another illustration of a semantic scene graph for generating a digital image in accordance with one or more embodiments;

FIG. 16 illustrates an overview of a multi-attribute contrast classification neural network in accordance with one or more embodiments;

FIG. 17 illustrates an architecture of a multi-attribute contrast classification neural network in accordance with one or more embodiments;

FIG. 18 illustrates an attribute-modifying neural network used by a scene-based image editing system to modify object attributes in accordance with one or more embodiments;

19A-19C illustrate a graphical user interface implemented by a scene-based image editing system to facilitate modifying object properties of objects depicted in a digital image in accordance with one or more embodiments;

FIGS. 20A-20C illustrate another graphical user interface implemented by a scene-based image editing system to facilitate modifying object properties of objects depicted in a digital image in accordance with one or more embodiments;

21A-21C illustrate yet another graphical user interface implemented by a scene-based image editing system to facilitate modifying object properties of objects depicted in a digital image in accordance with one or more embodiments;

FIGS. 22A-22D illustrate a graphical user interface implemented by a scene-based image editing system to facilitate relational awareness object modification in accordance with one or more embodiments;

23A-23C illustrate another graphical user interface implemented by a scene-based image editing system to facilitate relational awareness object modification in accordance with one or more embodiments;

24A-24C illustrate yet another graphical user interface implemented by a scene-based image editing system to facilitate relational aware object modification in accordance with one or more embodiments;

FIGS. 25A-25D illustrate a graphical user interface implemented by a scene-based image editing system to add objects to a selection for modification based on a classification relationship in accordance with one or more embodiments;

FIG. 26 illustrates a neural network pipeline used by a scene-based image editing system to identify and remove interfering objects from digital images in accordance with one or more embodiments;

FIG. 27 illustrates an architecture of an interferent detection neural network used by a scene-based image editing system to identify and classify interfering objects in a digital image in accordance with one or more embodiments;

FIG. 28 illustrates an architecture of a heatmap network used as part of an interferent detection neural network by a scene-based image editing system in accordance with one or more embodiments;

FIG. 29 illustrates an architecture of a hybrid classifier used as part of an interferent detection neural network by a scene-based image editing system in accordance with one or more embodiments;

FIGS. 30A-30C illustrate a graphical user interface implemented by a scene-based image editing system to identify and remove interfering objects from digital images in accordance with one or more embodiments;

31A-31C illustrate another graphical user interface implemented by a scene-based image editing system to identify and remove interfering objects from digital images in accordance with one or more embodiments;

FIGS. 32A-32B illustrate a scene-based image editing system utilizing intelligent dilation to remove objects from a digital image in accordance with one or more embodiments;

FIG. 33 illustrates an overview of a shadow detection neural network in accordance with one or more embodiments;

FIG. 34 illustrates an overview of an example segmentation component of a shadow detection neural network in accordance with one or more embodiments;

FIG. 35 illustrates an overview of an object perception component of a shadow detection neural network in accordance with one or more embodiments;

FIG. 36 illustrates an overview of a shadow prediction component of a shadow detection neural network in accordance with one or more embodiments;

FIG. 37 illustrates an overview of an architecture of a shadow detection neural network in accordance with one or more embodiments;

FIG. 38 illustrates a diagram for determining shadows associated with an object depicted in a digital image using a second stage of a shadow detection neural network, in accordance with one or more embodiments;

39A-39C illustrate a graphical user interface implemented by a scene-based image editing system to identify and remove shadows of objects depicted in a digital image in accordance with one or more embodiments;

FIG. 40 illustrates vanishing points and horizontal lines of a determined digital image in accordance with one or more embodiments;

FIG. 41 illustrates a perspective view of using a digital image to determine how to resize an object moving relative to the perspective view in accordance with one or more embodiments;

42A-42C illustrate a graphical user interface implemented by a scene-based image editing system to perform perspective-aware object movement operations on objects in accordance with one or more embodiments;

FIG. 43 illustrates a graphical user interface used by a scene-based image editing system to provide multiple perspective-based dimensional previews of objects depicted within a digital image in accordance with one or more embodiments;

FIG. 44 illustrates determining an object depth of an object depicted within a digital image in accordance with one or more embodiments;

45A-45C illustrate a graphical user interface used by a scene-based image editing system to reflect occlusions between objects within a digital image in accordance with one or more embodiments;

FIG. 46 illustrates object depths of objects depicted in a digital image in accordance with one or more embodiments;

47A-47C illustrate a graphical user interface implemented by a scene-based image editing system to perform depth-aware object movement operations in accordance with one or more embodiments;

48A-48C illustrate another graphical user interface implemented by a scene-based image editing system to perform depth-aware object movement operations in accordance with one or more embodiments;

FIG. 49 illustrates an overview of a scene-based image editing system that performs fill modifications in accordance with one or more embodiments;

FIG. 50A illustrates an example diagram of a scene-based image editing system that trains a diffuse neural network to generate a filled semantic graph in accordance with one or more embodiments;

FIG. 50B illustrates an example diagram of a scene-based image editing system utilizing a diffuse neural network to generate a filled semantic graph in accordance with one or more embodiments;

FIG. 51 illustrates an example diagram of a scene-based image editing system utilizing a diffuse neural network to generate a modified digital image in accordance with one or more embodiments;

FIG. 52 illustrates an example diagram of a scene-based image editing system adjusting a diffuse neural network with input textures in accordance with one or more embodiments;

FIG. 53 illustrates an example diagram of a scene-based image editing system that segments a portion of a digital image for replacement in accordance with one or more embodiments;

54A-54C illustrate an example graphical user interface of a scene-based image editing system that provides parameters for generating a modified digital image in accordance with one or more embodiments;

55A-55B illustrate an example graphical user interface of a scene-based image editing system that removes objects and generates modified digital images in accordance with one or more embodiments;

FIG. 56 illustrates an overview of a scene-based image editing system that performs human repair in accordance with one or more embodiments;

FIG. 57 illustrates an example diagram of a scene-based image editing system that generates encoding in accordance with one or more embodiments;

FIG. 58 illustrates an example diagram of a scene-based image editing system that generates local appearance feature tensors in accordance with one or more embodiments;

FIG. 59 illustrates an example diagram of a scene-based image editing system utilizing a layered encoder to generate structural encoding in accordance with one or more embodiments;

FIG. 60 illustrates an example diagram of a scene-based image editing system utilizing a layered encoder to generate local appearance feature tensors in accordance with one or more embodiments;

FIG. 61 illustrates an example diagram of a scene-based image editing system of a modulation pattern block in accordance with one or more embodiments;

FIG. 62 illustrates an example diagram of a scene-based image editing system utilizing separate segmentation map branches in accordance with one or more embodiments;

FIG. 63 illustrates an example diagram of a scene-based image editing system utilizing a background GAN and a human repair GAN to generate a modified digital image in accordance with one or more embodiments;

FIG. 64 illustrates an example diagram of a scene-based image editing system that trains a human repair GAN in accordance with one or more embodiments;

FIG. 65 illustrates an example schematic diagram of a scene-based image editing system in accordance with one or more embodiments;

FIG. 66 illustrates a flow diagram for a series of actions for implementing perspective-aware object-movement operations on digital images in accordance with one or more embodiments;

FIG. 67 illustrates a flow diagram of a series of acts for implementing a depth aware object moving operation on a digital image in accordance with one or more embodiments; and

FIG. 68 illustrates a block diagram of an exemplary computing device in accordance with one or more embodiments.

Detailed Description

One or more embodiments described herein include a scene-based image editing system that uses intelligent image understanding to implement scene-based image editing techniques. Indeed, in one or more embodiments, the scene-based image editing system utilizes one or more machine learning models to process the digital image in anticipation of user interaction for modifying the digital image. For example, in some implementations, a scene-based image editing system performs operations that construct a knowledge set for a digital image and/or automatically initiate a workflow for certain modifications before receiving user input for those modifications. Based on this preprocessing, the scene-based image editing system facilitates user interaction with the digital image as if it were a real scene reflecting real world conditions. For example, a scene-based image editing system enables user interaction to target pre-processed semantic regions (e.g., objects that have been identified and/or masked via pre-processing) as distinct components for editing rather than individual underlying pixels. Further, the scene-based image editing system automatically modifies the digital image to consistently reflect the corresponding real world conditions.

As described above, in one or more embodiments, a scene-based image editing system utilizes machine learning to process digital images to anticipate future modifications. Specifically, in some cases, the scene-based image editing system uses one or more machine learning models to perform preparatory operations that will facilitate subsequent modifications. In some embodiments, the scene-based image editing system automatically performs preprocessing in response to receiving the digital image. For example, in some implementations, a scene-based image editing system collects data and/or initiates a workflow for editing digital images prior to receiving user input for such editing. Thus, scene-based image editing systems allow user interaction to directly indicate the desired edits to a digital image, rather than the various preparatory steps often used to make such edits.

As an example, in one or more embodiments, a scene-based image editing system pre-processes digital images to facilitate object-aware modification. In particular, in some embodiments, the scene-based image editing system pre-processes the digital image to anticipate user input for manipulating one or more semantic regions of the digital image, such as user input for moving or deleting one or more objects within the digital image.

To illustrate, in some cases, a scene-based image editing system utilizes a segmented neural network to generate object masks for each object depicted in a digital image. In some cases, the scene-based image editing system utilizes a hole-filling model to generate a content fill (e.g., repair clip) for each object (e.g., mask for each corresponding object). In some implementations, the scene-based image editing system generates a complete background for the digital image by filling the pre-filled object holes with corresponding content. Thus, in one or more embodiments, a scene-based image editing system pre-processes digital images to prepare for object-aware modifications, such as move operations or delete operations, by pre-generating object masks and/or content fills before user input for such modifications is received.

Thus, upon receiving one or more user inputs for object-aware modification (e.g., a move operation or a delete operation) of an object of a digital image, the scene-based image editing system completes the modification with a corresponding pre-generated object mask and/or content fill. For example, in some cases, a scene-based image editing system detects user interactions (e.g., user selections of objects) with objects depicted therein via a graphical user interface that displays a digital image. In response to a user interaction, the scene-based image editing system presents a previously generated corresponding object mask. The scene-based image editing system also detects a second user interaction with the object (e.g., with the presented object mask) for moving or deleting the object via the graphical user interface. Accordingly, the scene-based image editing system moves or deletes the object, displaying the content fills that were previously located behind the object.

Further, in one or more embodiments, a scene-based image editing system pre-processes a digital image to generate a semantic scene graph for the digital image. In particular, in some embodiments, a scene-based image editing system generates a semantic scene graph to map various characteristics of a digital image. For example, in some cases, a scene-based image editing system generates a semantic scene graph that describes objects depicted in a digital image, relationships or object properties of those objects, and/or various other characteristics determined to be available for subsequent modification of the digital image.

In some cases, the scene-based image editing system utilizes one or more machine learning models to determine characteristics of digital images to be included in the semantic scene graph. Further, in some cases, the scene-based image editing system utilizes one or more predetermined or pre-generated template maps to generate a semantic scene graph. For example, in some embodiments, the scene-based image editing system utilizes image analysis graphs, real-world category description graphs, and/or behavior policy graphs in generating semantic scenes.

Thus, in some cases, the scene-based image editing system facilitates modification of digital images using semantic scene graphs generated for the digital images. For example, in some embodiments, upon determining that an object has been selected for modification, the scene-based image editing system retrieves characteristics of the object from the semantic scene graph to facilitate the modification. To illustrate, in some implementations, a scene-based image editing system performs or suggests one or more additional modifications to a digital image based on characteristics from a semantic scene graph.

As one example, in some embodiments, upon determining that an object has been selected for modification, a scene-based image editing system provides one or more object properties of the object for display via a graphical user interface that displays the object. For example, in some cases, a scene-based image editing system retrieves a set of object properties (e.g., size, shape, or color) of an object from a corresponding semantic scene graph and presents the set of object properties for display in association with the object.

In some cases, the scene-based image editing system also facilitates user interaction with the displayed set of object properties to modify one or more object properties. For example, in some embodiments, the scene-based image editing system allows for user interaction of text that changes a displayed set of object properties or selections from a provided set of object property alternatives. Based on the user interaction, the scene-based image editing system modifies the digital image by modifying one or more object properties according to the user interaction.

As another example, in some implementations, a scene-based image editing system utilizes a semantic scene graph to implement relationship-aware object modification. To illustrate, in some cases, a scene-based image editing system detects a user interaction that selects an object depicted in a digital image for modification. The scene-based image editing system references a semantic scene graph previously generated for the digital image to identify a relationship between the object and one or more other objects depicted in the digital image. Based on the identified relationships, the scene-based image editing system also targets one or more related objects for modification.

For example, in some cases, the scene-based image editing system automatically adds one or more related objects to the user selection. In some cases, the scene-based image editing system provides a suggestion to include one or more related objects in the user selection and adds one or more related objects based on acceptance of the suggestion. Thus, in some embodiments, the scene-based image editing system modifies one or more related objects as it modifies the user-selected object.

In one or more embodiments, in addition to preprocessing the digital image to identify depicted objects and their relationships and/or object properties, the scene-based image editing system also preprocesses the digital image to help remove interfering objects. For example, in some cases, a scene-based image editing system utilizes an interferent detection neural network to classify one or more objects depicted in a digital image as subjects of the digital image and/or to classify one or more other objects depicted in the digital image as interfering objects. In some embodiments, the scene-based image editing system provides visual indications of interfering objects within the display of the digital image suggesting removal of these objects to present a more aesthetic and consistent visual result.

Furthermore, in some cases, the scene-based image editing system detects shadows of interfering objects (or other selected objects) for removal with the interfering objects. In particular, in some cases, a scene-based image editing system utilizes a shadow detection neural network to identify shadows depicted in a digital image and associate the shadows with their corresponding objects. Thus, upon removal of the interfering object from the digital image, the scene-based image editing system also automatically removes the associated shadow.

In some embodiments, the scene-based image editing system uses three-dimensional effects to modify objects within the digital image. For example, in some cases, a scene-based image editing system moves objects within a digital image relative to a perspective associated with the digital image. In some cases, the scene-based image editing system also adjusts the size of the object based on the movement. In some cases, the scene-based image editing system provides occlusion for objects that have been moved to overlap with another object within the digital image based on an object depth determined for the object (e.g., determined via preprocessing).

Additionally, in some implementations, the scene-based image editing system tracks semantic histories of digital images that have been edited. For example, in some cases, a scene-based image editing system generates and maintains a semantic history log that reflects the semantic status of digital images produced by various edits. In some cases, the scene-based image editing system also facilitates interactions with the semantic history log (e.g., via a graphical user interface) to enable a user to view and/or modify previous semantic states.

In some embodiments, the scene-based image editing system utilizes multimodal interactions to modify the digital image. To illustrate, in some embodiments, a scene-based image editing system uses voice input and gesture interactions (e.g., touch interactions with a touch screen of a client device) in editing a digital image. Based on the received two inputs, the scene-based image editing system determines a target edit of the digital image.

Scene-based image editing systems offer advantages over conventional systems. In fact, conventional image editing systems suffer from several technical drawbacks, resulting in inflexible and inefficient operation. To illustrate, conventional systems are often inflexible in that they strictly perform editing on digital images at the pixel level. In particular, conventional systems typically perform a particular edit by targeting pixels individually for the edit. Thus, such systems often strictly require user interaction for editing a digital image to interact with a single pixel to indicate the region for editing. Furthermore, many conventional systems (e.g., due to their pixel-based editing) require the user to have a great deal of in-depth expertise in how to interact with the digital image and the user interface of the system itself, to select the desired pixels and to perform the appropriate workflow to edit those pixels.

Furthermore, conventional image editing systems often do not function efficiently. For example, conventional systems typically require a significant amount of user interaction to modify the digital image. In fact, conventional systems typically require a user to interact with multiple menus, sub-menus, and/or windows to perform editing, in addition to user interaction for selecting a single pixel. For example, many edits may require multiple editing steps to be performed using multiple different tools. Thus, many conventional systems require multiple interactions to select the appropriate tool at a given editing step, set the required parameters for the tool, and utilize the tool to perform the editing step.

The scene-based image editing system has improved flexibility compared to conventional systems. In particular, scene-based image editing systems implement techniques that facilitate flexible scene-based editing. For example, by preprocessing a digital image via machine learning, a scene-based image editing system allows editing a digital image as if it were a real scene, where the various elements of the scene are known and can interact with it on a semantic level to perform editing while continuously reflecting real world conditions. Indeed, where a pixel is the target unit under many conventional systems and an object is typically considered a group of pixels, a scene-based image editing system allows a user to interactively treat the entire semantic region (e.g., object) as a distinct unit. Furthermore, where conventional systems typically require in-depth, specialized knowledge of the tools and workflows required to perform editing, scene-based editing systems provide a more intuitive editing experience that enables users to focus on the final goals of editing.

Furthermore, scene-based image editing systems have higher efficiency than conventional systems. In particular, the scene-based image editing system implements a graphical user interface that reduces the user interaction required for editing. In effect, by preprocessing the digital image for desired editing, the scene-based image editing system reduces the user interaction required to perform editing. In particular, a scene-based image editing system performs many of the operations required for editing, without relying on user instructions to perform those operations. Thus, in many cases, scene-based image editing systems reduce the user interaction typically required in conventional systems to select pixels to be the editing target and navigate menus, submenus, or other windows to select a tool, select its corresponding parameters, and apply the tool to perform editing. By implementing a graphical user interface that reduces and simplifies the user interaction required to edit digital images, a scene-based image editing system provides an improved user experience on computing devices (e.g., tablet or smartphone devices) that have relatively limited screen space.

Additional details regarding the scene-based image editing system will now be provided with reference to the accompanying drawings. For example, FIG. 1 shows a schematic diagram of an exemplary system 100 in which a scene-based image editing system 106 operates. As shown in FIG. 1, system 100 includes server(s) 102, network 108, and client devices 110a-110n.

Although the system 100 of fig. 1 is depicted as having a particular number of components, the system 100 can have any number of additional or alternative components (e.g., any number of servers, client devices, or other components in communication with the scene-based image editing system 106 via the network 108). Similarly, while FIG. 1 shows a particular arrangement of server(s) 102, network 108, and client devices 110a-110n, various additional arrangements are possible.

Server(s) 102, network 108, and client devices 110a-110n are communicatively coupled to each other either directly or indirectly (e.g., through network 108 discussed in more detail below in connection with fig. 66). Further, server(s) 102 and client devices 110a-110n include one or more of a variety of computing devices (including one or more of the computing devices discussed in more detail with reference to FIG. 66).

As described above, the system 100 includes the server(s) 102. In one or more embodiments, server(s) 102 generate, store, receive, and/or transmit data including digital images and modified digital images. In one or more embodiments, server(s) 102 include a data server. In some implementations, the server(s) 102 include a communication server or a web hosting server.

In one or more embodiments, the image editing system 104 provides functionality for a client device (e.g., a user of one of the client devices 110a-110 n) to generate, edit, manage, and/or store digital images. For example, in some cases, the client device transmits the digital image to the image editing system 104 hosted on the server(s) 102 via the network 108. The image editing system 104 then provides options that the client device can use to edit the digital image, store the digital image, and then search for, access, and view the digital image. For example, in some cases, the image editing system 104 provides one or more options that a client device may use to modify objects within a digital image.

In one or more embodiments, client devices 110a-110n include computing devices that access, view, modify, store, and/or provide digital images for display. For example, client devices 110a-110n include smart phones, tablets, desktop computers, laptop computers, head mounted display devices, or other electronic devices. Client devices 110a-110n include one or more applications (e.g., client application 112) that may access, view, modify, store, and/or provide digital images for display. For example, in one or more embodiments, client application 112 includes a software application installed on client devices 110a-110 n. Additionally or alternatively, client application 112 includes a web browser or other application that accesses software applications hosted on server(s) 102 (and supported by image editing system 104).

To provide an example implementation, in some embodiments, the scene-based image editing system 106 on the server(s) 102 supports the scene-based image editing system 106 on the client device 110n. For example, in some cases, the scene-based image editing system 106 on the server(s) 102 learns parameters of the neural network(s) 114 for analyzing and/or modifying the digital image. The scene-based image editing system 106 then provides the neural network(s) 114 to the client device 110n via the server(s) 102. In other words, the client device 110n obtains (e.g., downloads) the neural network(s) 114 with the learned parameters from the server(s) 102. Once downloaded, the scene-based image editing system 106 on the client device 110n utilizes the neural network(s) 114 to analyze and/or modify the digital image independent of the server(s) 102.

In an alternative implementation, the scene-based image editing system 106 includes a web hosting application that allows the client device 110n to interact with content and services hosted on the server(s) 102. To illustrate, in one or more implementations, the client device 110n accesses a software application supported by the server(s) 102. In response, the scene-based image editing system 106 on the server(s) 102 modifies the digital image. Server(s) 102 then provide the modified digital image to client device 110n for display.

Indeed, the scene-based image editing system 106 can be implemented in whole or in part by the various elements of the system 100. Indeed, while FIG. 1 shows scene-based image editing system 106 implemented with respect to server(s) 102, the different components of scene-based image editing system 106 can be implemented by various devices within system 100. For example, one or more (or all) of the components of the scene-based image editing system 106 are implemented by a different computing device (e.g., one of the client devices 110a-110 n) or a separate server than the server(s) 102 hosting the image editing system 104. In fact, as shown in FIG. 1, client devices 110a-110n include scene-based image editing system 106. Example components of scene-based image editing system 106 are described below with reference to fig. 65.

As described above, in one or more embodiments, the scene-based image editing system 106 manages two-dimensional digital images as real scenes reflecting real world conditions. In particular, scene-based image editing system 106 implements a graphical use interface that facilitates modification of digital images to real scenes. FIG. 2 illustrates an overview of a scene-based image editing system 106 managing digital images as real scenes in accordance with one or more embodiments.

As shown in fig. 2, scene-based image editing system 106 provides a graphical user interface 202 for display on a client device 204. As further shown, the scene-based image editing system 106 provides a digital image 206 for display within the graphical user interface 202. In one or more embodiments, after capturing the digital image 206 via the camera of the client device 204, the scene-based image editing system 106 provides the digital image 206 for display. In some cases, the scene-based image editing system 106 receives the digital image 206 from another computing device or otherwise accesses the digital image 206 at some storage location, either local or remote.

As shown in fig. 2, the digital image 206 depicts various objects. In one or more embodiments, the object includes different visual components depicted in the digital image. In particular, in some embodiments, the object includes a different visual element that may be identified separately from other visual elements depicted in the digital image. In many cases, an object includes a set of pixels that together depict different visual elements that are separate from the depiction of other pixels. An object refers to a visual representation of a subject, concept, or sub-concept in an image. In particular, an object refers to a group of pixels in an image that combine together to form a visual depiction of an item, article, portion of an item, component, or element. In some cases, objects may be identified via different levels of abstraction. In other words, in some cases, an object includes individual object components that may be individually identified or as part of an aggregation. To illustrate, in some embodiments, the object includes a semantic region (e.g., sky, ground, water, etc.). In some embodiments, the object includes an instance of an identifiable thing (e.g., a person, an animal, a building, an automobile, or a cloud, clothing, or some other accessory). In one or more embodiments, an object includes a child object, component, or portion. For example, a human face, hair, or leg may be an object that is part of another object (e.g., a human body). In yet another implementation, the shadow or reflection includes a portion of the object. As another example, a shirt is an object that may be part of another object (e.g., a person).

As shown in fig. 2, the digital image 206 depicts a static two-dimensional image. In particular, the digital image 206 depicts a two-dimensional projection of a scene captured from the perspective of the camera. Thus, the digital image 206 reflects the condition of the captured image (e.g., lighting, ambient environment, or physical condition to which the depicted object is subjected); however, it is performed statically. In other words, the condition is not inherently maintained when changes are made to the digital image 206. Under many conventional systems, additional user interaction is required to maintain consistency with these conditions when editing digital images.

Further, the digital image 206 includes a plurality of individual pixels that collectively depict various semantic regions. For example, digital image 206 depicts a plurality of objects, such as objects 208a-208c. Although the pixels of each object contribute to the depiction of consecutive (visual) visual units, they are not generally considered to be consecutive visual units. In practice, a pixel of a digital image is typically inherently treated as a single cell with its own value (e.g., color value), which can be modified separately from the values of other pixels. Thus, conventional systems typically require user interaction to individually target pixels for modification when making changes to the digital image.

However, as shown in FIG. 2, the scene-based image editing system 106 manages the digital image 206 as a real scene, consistently maintaining the condition of capturing the image as the digital image is modified. In particular, the scene-based image editing system 106 automatically maintains conditions without relying on user input to reflect those conditions. In addition, the scene-based image editing system 106 manages the digital image 206 at a semantic level. In other words, the digital image 206 manages each semantic region depicted in the digital image 206 as a coherent unit (coherent). For example, as shown in FIG. 2 and discussed, no user interaction is required to select the underlying pixel to interact with the corresponding object, but rather the scene-based image editing system 106 enables user input to take the object as a unit, and the scene-based image editing system 106 automatically identifies the pixel associated with the object.

To illustrate, as shown in fig. 2, in some cases, the scene-based image editing system 106 operates on a computing device 200 (e.g., a client device 204 or a separate computing device, such as the server(s) 102 discussed above with reference to fig. 1) to pre-process the digital image 206. In particular, the scene-based image editing system 106 performs one or more preprocessing operations to anticipate future modifications to the digital image. In one or more embodiments, the scene-based image editing system 106 automatically performs these preprocessing operations in response to receiving or accessing the digital image 206 before user input for making the desired modification has been received. As further shown, the scene-based image editing system 106 utilizes one or more machine learning models, such as neural network(s) 114, to perform preprocessing operations.

In one or more embodiments, the scene-based image editing system 106 pre-processes the digital image 206 by learning characteristics of the digital image 206. For example, in some cases, scene-based image editing system 106 segments digital image 206, identifies objects, classifies objects, determines relationships and/or properties of objects, determines illumination characteristics, and/or determines depth/perspective characteristics. In some embodiments, scene-based image editing system 106 pre-processes digital image 206 by generating content for modifying digital image 206. For example, in some implementations, the scene-based image editing system 106 generates object masks for each depicted object and/or generates content fills for filling in the background behind each depicted object. The background refers to the content behind the object in the image. Thus, when the first object is positioned before the second object, the second object forms at least part of the background of the first object. Alternatively, the background contains the furthest elements in the image (typically semantic regions such as sky, ground, water, etc.). In one or more embodiments, the context of an object includes a plurality of object/semantic regions. For example, the background of an object may include a portion of another object and a portion of the furthest element in the image. Various preprocessing operations and their use in modifying digital images will be discussed in more detail below with reference to subsequent figures.

As shown in FIG. 2, scene-based image editing system 106 detects user interactions with object 208c via graphical user interface 202. In particular, scene-based image editing system 106 detects a user interaction for selecting object 208c. Indeed, in one or more embodiments, scene-based image editing system 106 determines that the user interaction targets an object, even where the user interaction interacts with only a subset of pixels of pre-processed contributing object 208c based on digital image 206. For example, as described above, in some embodiments, the scene-based image editing system 106 pre-processes the digital image 206 via segmentation. Thus, upon detecting user interaction, the scene-based image editing system 106 has partitioned/segmented the digital image 206 into its various semantic regions. Thus, in some cases, scene-based image editing system 106 determines that the user interaction selects a different semantic region (e.g., object 208 c) than the particular underlying pixel or image layer with which the user interacted.

As further shown in fig. 2, scene-based image editing system 106 modifies digital image 206 via modification to object 208c. Although FIG. 2 illustrates the deletion of object 208c, various modifications are possible and will be discussed in more detail below. In some embodiments, scene-based image editing system 106 edits object 208c in response to detecting a second user interaction to perform the modification.

As shown, upon deleting object 208c from digital image 206, scene-based image editing system 106 automatically displays background pixels that have been placed in the location of object 208 c. Indeed, as described above, in some embodiments, the scene-based image editing system 106 pre-processes the digital image 206 by generating a content fill for each depicted foreground object. Thus, as shown in FIG. 2, when object 208c is removed from digital image 206, scene-based image editing system 106 automatically exposes content fills 210 previously generated for object 208 c. In some cases, scene-based image editing system 106 locates content fill 210 within the digital image such that content fill 210 is exposed, rather than a hole occurring when object 208c is removed.

Thus, the scene-based image editing system 106 has improved flexibility compared to many conventional systems. In particular, the scene-based image editing system 106 implements flexible scene-based editing techniques in which digital images are modified to maintain a real scene of real world conditions (e.g., physical, environmental, or object relationships). Indeed, in the example shown in fig. 2, scene-based image editing system 106 utilizes pre-generated content fills to consistently maintain the context depicted in digital image 206 as if digital image 206 had captured the entirety of the context. Thus, the scene-based image editing system 106 enables the depicted objects to freely move around (or completely remove) without interrupting the scene depicted therein.

Furthermore, the scene-based image editing system 106 operates with improved efficiency. In effect, by segmenting digital image 206 and generating content pad 210 in anticipation of modifications to remove object 208c from its location in digital image 206, scene-based image editing system 106 reduces user interactions that would normally be required to perform those same operations under conventional systems. Thus, the scene-based image editing system 106 is able to implement the same modification to the digital image with less user interaction than these conventional systems.

As described above, in one or more embodiments, scene-based image editing system 106 implements object-aware image editing on digital images. In particular, scene-based image editing system 106 implements object-aware modification of objects as interactable and modifiable coherent units (coherence units). Fig. 3-9B illustrate a scene-based image editing system 106 implementing object-aware modifications in accordance with one or more embodiments.

In fact, many conventional image editing systems are inflexible and inefficient in interacting with objects depicted in digital images. For example, as previously mentioned, conventional systems are typically inflexible in that they require user interaction to target separate pixels, rather than objects depicted by those pixels. Thus, such systems typically require a strict and careful procedure to select the pixels to be modified. Furthermore, when objects are identified via user selection, these systems often fail to predict and prepare for potential editing of these objects.

In addition, many conventional image editing systems require a significant amount of user interaction to modify objects depicted in the digital image. Indeed, in addition to the pixel selection process for identifying objects in a digital image, which may itself require a series of user interactions, conventional systems may require a considerable workflow in which a user interacts with multiple menus, sub-menus, tools and/or windows to perform editing. Typically, performing editing on an object requires multiple preparatory steps before the required editing can be performed, which requires additional user interaction.

Scene-based image editing system 106 provides advantages over these systems. For example, the scene-based image editing system 106 provides improved flexibility in image editing via object perception. In particular, scene-based image editing system 106 implements object-level interactions-rather than pixel-level or hierarchical interactions-to facilitate user interactions that target objects depicted as coherent units directly rather than their separate constituent pixels.

Furthermore, the scene-based image editing system 106 improves the efficiency of interaction with objects depicted in the digital image. In fact, as previously described, and as will be discussed further below, the scene-based image editing system 106 implements preprocessing operations for identifying and/or segmenting depicted objects in anticipation of modifications to those objects. Indeed, in many cases, the scene-based image editing system 106 performs these preprocessing operations without receiving user interactions for those modifications. Thus, the scene-based image editing system 106 reduces the user interaction required to perform a given edit on the depicted object.

In some embodiments, the scene-based image editing system 106 implements object-aware image editing by generating object masks (object masks) for each object/semantic region depicted in the digital image. Specifically, in some cases, the scene-based image editing system 106 utilizes a machine learning model, such as a segmented neural network, to generate the object mask(s). FIG. 3 illustrates a segmented neural network used by a scene-based image editing system 106 to generate object masks for objects in accordance with one or more embodiments.

In one or more embodiments, the object mask includes a map of the digital image with, for each pixel, an indication of whether the pixel corresponds to a portion (or other semantic region) of the object. In some implementations, the indication includes a binary indication (e.g., the indication is a "1" for pixels belonging to the object and a "0" for pixels not belonging to the object). In an alternative implementation, the indication includes a probability (e.g., a number between 1 and 0) that indicates the likelihood that the pixel belongs to the object. In such an implementation, the closer the value is to 1, the more likely a pixel is to belong to an object, and vice versa.

In one or more embodiments, the machine learning model includes a computer representation that is tunable (e.g., trained) based on inputs to approximate an unknown function used to generate the respective outputs. Specifically, in some embodiments, the machine learning model includes a computer-implemented model that utilizes algorithms to learn and predict known data by analyzing the known data to learn to generate an output reflecting patterns and attributes of the known data. For example, in some cases, the machine learning model includes, but is not limited to, a neural network (e.g., convolutional neural network, recurrent neural network, or other deep learning network), a decision tree (e.g., gradient-enhanced decision tree), association rule learning, inductive logic programming, support vector learning, bayesian network, regression-based model (e.g., erasure regression), principal component analysis, or a combination thereof.

In one or more embodiments, the neural network includes a model of interconnected artificial neurons (e.g., organized in layers) that communicate and learn approximately complex functions, and generate outputs based on a plurality of inputs provided to the model. In some cases, the neural network includes one or more machine learning algorithms. Furthermore, in some cases, the neural network includes an algorithm (or set of algorithms) that implements a deep learning technique that utilizes a set of algorithms to model high-level abstractions in the data. To illustrate, in some embodiments, the neural network includes a convolutional neural network, a recurrent neural network (e.g., a long and short term memory neural network), a generative antagonistic neural network, a graph neural network, or a multi-layer perceptron. In some embodiments, the neural network comprises a neural network or a combination of neural network components.

In one or more embodiments, the segmented neural network includes a computer-implemented neural network that generates object masks for objects depicted in the digital image. In particular, in some embodiments, the segmented neural network includes a computer-implemented neural network that detects objects within the digital image and generates object masks for the objects. Indeed, in some implementations, the segmented neural network includes a neural network pipeline that analyzes the digital image, identifies one or more objects depicted in the digital image, and generates object masks for the one or more objects. However, in some cases, the segmented neural network focuses on a subset of the tasks used to generate the object mask.

As described above, FIG. 3 illustrates one example of a segmented neural network that a scene-based image editing system 106 uses in one or more implementations to generate object masks for objects depicted in a digital image. In particular, FIG. 3 illustrates one example of a segmented neural network that scene-based image editing system 106 uses to detect objects in a digital image and generate object masks for the objects in some embodiments. In effect, fig. 3 shows a detection masking neural network 300 that includes an object detection machine learning model 308 (in the form of an object detection neural network) and an object segmentation machine learning model 310 (in the form of an object segmentation neural network). Specifically, detecting masking neural network 300 is an implementation of an on-device masking system described in U.S. patent application Ser. No. 17/589,114, filed on 1/2022, 31, the entire contents of which are incorporated herein by reference.

Although fig. 3 illustrates a scene-based image editing system 106 that utilizes a detection masking neural network 300, in one or more implementations, the scene-based image editing system 106 utilizes a different machine learning model to detect objects, generate object masks for objects, and/or extract objects from digital images. For example, in one or more implementations, the scene-based image editing system 106 utilizes one of the machine learning models or neural networks described in the following documents as (or as an alternative to) the segmented neural network: U.S. patent application Ser. No. 17/158,527, entitled "Segmenting Objects In Digital Images Utilizing A Multi-Object Segmentation Model Framework" (digital image object segmentation based on Multi-object segmentation model framework), filed on 1, 26, 2021; or U.S. patent application Ser. No. 16/388,115, entitled "Robust Training of Large-Scale Object Detectors with Noisy Data" (robust training of large scale target detectors with noisy data) filed on 4/8/2019; or U.S. patent application Ser. No. 16/518,880 entitled "Utilizing Multiple Object Segmentation Models To Automatically Select User-Requested Objects In Images" (automatic selection of user-requested objects in an image using a multi-target segmentation model) filed on 7/22/2019; or U.S. patent application Ser. No. 16/817,418, entitled "Utilizing A Large-Scale Object Detector To Automatically Select Objects In Digital Images" (automatic selection of targets in digital images with a large-scale target detector) filed 3/20/2020; or Ren et al, "fast r-cnn: towards real-time object detection with region proposal networks" (Faster r-cnn: real-time target detection using regional advice networks), NIPS,2015; or Redmon et al, "You Only Look Once: unified, real-Time Object Detection" (you need only see once: unified Real-time object detection), CVPR 2016, the contents of each of the foregoing applications and papers are incorporated herein by reference in their entirety.

Similarly, in one or more implementations, the scene-based image editing system 106 utilizes one of the machine learning models or neural networks described in the following documents as (or as an alternative to) the segmented neural network: ning Xu et al, "Deep GrabCut for Object Selection" (deep grip cut for object selection), published in 2017, 7, 14; or U.S. patent application publication No. 2019/013129 filed on 10/31 in 2017 entitled "Deep Salient Content Neural Networks for Efficient Digital Object Segmentation" (application of deep prominent content neural network in digital object segmentation); or U.S. patent application Ser. No. 16/035,410 entitled "Automatic Trimap Generation and Image Segmentation" (trigeminal automatic generation and image segmentation) filed on 7/13/2018; or U.S. patent No. 10,192,129, entitled "Utilizing Interactive Deep Learning To Select Objects In Digital Visual Media" (selecting objects in digital visual media using interactive deep learning), filed on 11/18/2015, each of which is incorporated herein by reference in its entirety.

In one or more implementations, the segmented neural network is a panoramic segmented neural network. In other words, the split neural network creates object masks for individual instances of a given object type. Further, in one or more implementations, the segmented neural network generates object masks for semantic regions (e.g., water, sky, sand, dirt, etc.), except for countless things. Indeed, in one or more implementations, the scene-based image editing system 106 utilizes one of the machine learning models or neural networks described in the following documents as (or as an alternative to) the segmented neural network: U.S. patent application Ser. No. 17/495,618 entitled "PANOPTIC SEGMENTATION REFINEMENT NETWORK" (panoramic segmentation refinement network) filed on day 10 and 2 of 2021; or U.S. patent application Ser. No. 17/454,740 entitled "MULTI-SOURCE PANOPTIC FEATURE PYRAMID NETWORK" (Multi-source panoramic feature pyramid network), filed 11/12/2021, each of which is incorporated herein by reference in its entirety.

Returning now to fig. 3, in one or more implementations, the scene-based image editing system 106 utilizes a detection masking neural network 300, the detection masking neural network 300 including an encoder 302 (or neural network encoder) having a backbone network, a detection head (or neural network decoder head) 304, and a masking head (or neural network decoder head) 306. As shown in fig. 3, encoder 302 encodes digital image 316 and provides the encoding to detection head 304 and masking head 306. The detection head 304 utilizes encoding to detect one or more objects depicted in the digital image 316. The mask header 306 generates at least one object mask for the detected object.

As described above, the detection masking neural network 300 utilizes both the object detection machine learning model 308 and the object segmentation machine learning model 310. In one or more implementations, the object detection machine learning model 308 includes both the encoder 302 and the detection head 304 shown in fig. 3. While object segmentation machine learning model 310 includes both encoder 302 and masking head 306. Further, the object detection machine learning model 308 and the object segmentation machine learning model 310 are separate machine learning models for processing objects within the target and/or source digital images. Fig. 3 shows an encoder 302, a detection head 304, and a masking head 306 as a single model of an object for detecting and segmenting a digital image. For efficiency purposes, in some embodiments, scene-based image editing system 106 uses the network shown in FIG. 3 as a single network. The aggregate network (i.e., the object detection machine learning model 308 and the object segmentation machine learning model 310) is referred to as the detection masking neural network 300. The following paragraphs describe components related to the object detection machine learning model 308 of the network (such as the detection head 304) and transition to discuss components related to the object segmentation machine learning model 310.

As described above, in one or more embodiments, scene-based image editing system 106 utilizes object detection machine learning model 308 to detect and identify objects (e.g., target or source digital images) within digital image 316. FIG. 3 illustrates one implementation of an object detection machine learning model 308 used by the scene-based image editing system 106 in accordance with at least one embodiment. In particular, FIG. 3 illustrates a scene-based image editing system 106 that utilizes an object detection machine learning model 308 to detect objects. In one or more embodiments, the object detection machine learning model 308 includes a deep learning Convolutional Neural Network (CNN). For example, in some embodiments, the object detection machine learning model 308 includes a region-based (R-CNN).

As shown in fig. 3, the object detection machine learning model 308 includes a lower neural network layer and an upper neural network layer. Typically, lower neural network layers collectively form an encoder 302, while higher neural network layers collectively form a detection head 304 (e.g., a decoder). In one or more embodiments, encoder 302 includes a convolutional layer that encodes the digital image into feature vectors that are output from encoder 302 and provided as inputs to detection head 304. In various implementations, the detection head 304 includes a fully connected layer that analyzes feature vectors and outputs detected objects (possibly with approximate boundaries around the objects).

In particular, in one or more implementations, encoder 302 includes a convolutional layer that generates feature vectors in the form of feature maps. To detect objects within the digital image 316, the object detection machine learning model 308 processes the feature map with a convolution layer in the form of a small network that slides across the small windows of the feature map. The object detection machine learning model 308 also maps each sliding window to a low-dimensional feature. In one or more embodiments, the object detection machine learning model 308 processes the feature using two separate detection heads as fully connected layers. In some embodiments, the first head includes a box-regression layer (box-regression layer) that generates the detected objects and an object classification layer that generates object tags.

As shown in fig. 3, the output from the detection head 304 displays the object tag above each detected object. For example, the detection masking neural network 300 assigns an object tag to each detected object in response to detecting the object. Specifically, in some embodiments, the detection masking neural network 300 utilizes object tags based on object classification. For illustration, fig. 3 shows a tag 318 for women, a tag 320 for birds, and a tag 322 for men. Although not shown in fig. 3, in some implementations, the detection masking neural network 300 further distinguishes between women and surfboards held by women. Furthermore, the detection masking neural network 300 also optionally generates object masks for the illustrated semantic regions (e.g., sand, sea, and sky).

As described above, the object detection machine learning model 308 detects objects within the digital image 316. In some embodiments, and as shown in fig. 3, detection masking neural network 300 indicates the detected object with approximate boundaries (e.g., bounding boxes 319, 321, and 323). For example, each bounding box includes an area surrounding the object. In some embodiments, the detection masking neural network 300 annotates the bounding box with the previously mentioned object tags, such as the name of the detected object, coordinates of the bounding box, and/or the size of the bounding box.

As shown in fig. 3, the object detection machine learning model 308 detects several objects of the digital image 316. In some cases, detecting masking neural network 300 identifies all objects within the bounding box. In one or more embodiments, the bounding box includes an approximate bounding region that indicates the detected object. In some cases, an approximate boundary refers to an indication of a region that includes objects that are larger and/or less precise than the object mask. In one or more embodiments, the approximate boundary includes at least a portion of the detected object and a portion of the digital image 316 that does not include the detected object. The approximate boundaries include various shapes, such as square, rectangular, circular, oval, or other contours around the object. In one or more embodiments, the approximate boundary includes a bounding box.

Upon detecting an object in the digital image 316, the detection masking neural network 300 generates an object mask for the detected object. In general, rather than using a rough bounding box during object localization, the detection masking neural network 300 generates a segmentation mask that better defines the boundary of the object. The following paragraphs provide additional details regarding generating object masks for detected objects in accordance with one or more embodiments. In particular, FIG. 3 illustrates that the scene-based image editing system 106 utilizes an object segmentation machine learning model 310 to generate segmented objects via object masking, according to some embodiments.

As shown in fig. 3, the scene-based image editing system 106 processes the objects detected in the bounding box using the object segmentation machine learning model 310 to generate object masks, such as object mask 324 and object mask 326. In an alternative embodiment, the scene-based image editing system 106 utilizes the object detection machine learning model 308 itself to generate object masks for the detected objects (e.g., to segment the objects for selection).

In one or more implementations, prior to generating the object mask of the detected object, the scene-based image editing system 106 receives user input 312 to determine the object for which the object mask is to be generated. For example, the scene-based image editing system 106 receives input from a user indicating a selection of one of the detected objects. To illustrate, in the illustrated implementation, the scene-based image editing system 106 receives user input 312 that a user selects bounding box 321 and bounding box 323. In an alternative implementation, the scene-based image editing system 106 automatically generates object masks for each object (e.g., without a user request indicating the object to be selected).

As described above, the scene-based image editing system 106 processes bounding boxes of objects detected in the digital image 316 using the object segmentation machine learning model 310. In some embodiments, the bounding box includes an output from the object detection machine learning model 308. For example, as shown in FIG. 3, the bounding box includes a rectangular border around the object. Specifically, fig. 3 shows bounding boxes 319, 321, and 323 around women, birds, and men detected in digital image 316.

In some embodiments, the scene based image editing system 106 utilizes the object segmentation machine learning model 310 to generate object masks for the aforementioned detected objects within the bounding box. For example, the object segmentation machine learning model 310 corresponds to one or more deep neural networks or models that select an object based on bounding box parameters corresponding to the object within the digital image 316. Specifically, object segmentation machine learning model 310 generates object masks 324 and 326 for detected men and birds, respectively.

In some embodiments, the scene-based image editing system 106 selects the object segmentation machine learning model 310 based on object tags of objects identified by the object detection machine learning model 308. Generally, based on identifying one or more object categories associated with the input bounding box, the scene-based image editing system 106 selects an object segmentation machine learning model that is tuned to generate object masks for the identified one or more categories of objects. To illustrate, in some embodiments, based on determining that the one or more identified categories of objects include humans or people, the scene-based image editing system 106 utilizes a special human object masking neural network to generate an object mask, such as the object mask 324 shown in fig. 3.

As further shown in fig. 3, the scene-based image editing system 106 receives object masks 324 and 326 as output from the object segmentation machine learning model 310. As previously described, in one or more embodiments, the object mask includes a pixel-level mask corresponding to an object in the source or target digital image. In one example, object masking includes a prediction edge indicating one or more objects and a segmentation boundary of pixels contained within the prediction edge.

In some embodiments, scene-based image editing system 106 also detects objects shown in digital image 316 via the collective network, i.e., detection masking neural network 300, in the same manner as described above. For example, in some cases, scene-based image editing system 106 detects women, men, and birds within digital image 316 via detection masking neural network 300. Specifically, scene-based image editing system 106 utilizes feature pyramids (feature features pyramids) and feature maps to identify objects within digital image 316 via detection head 304, and generates object masks via mask head 306.

Further, in one or more implementations, although FIG. 3 illustrates generating object masks based on user input 312, scene-based image editing system 106 generates object masks without user input 312. Specifically, scene-based image editing system 106 generates object masks for all detected objects within digital image 316. To illustrate, in at least one implementation, the scene-based image editing system 106 generates object masks for women, men, and birds despite not receiving the user input 312.

In one or more embodiments, the scene-based image editing system 106 implements object-aware image editing by generating content fills for each object depicted in the digital image (e.g., for each object mask corresponding to the depicted object) using a hole-filling model. Specifically, in some cases, scene-based image editing system 106 utilizes a machine learning model, such as a hole-filling machine learning model of content perception, to generate content fill(s) for each foreground object. Fig. 4-6 illustrate a hole-filling machine learning model used by a scene-based image editing system 106 to generate content-filled content-aware holes for objects in accordance with one or more embodiments.

In one or more embodiments, the content fill includes a set of pixels generated for replacing another set of pixels of the digital image. Indeed, in some embodiments, the content fill includes a set of replacement pixels for replacing another set of pixels. For example, in some embodiments, the content fill includes a set of pixels generated to fill a hole (e.g., a content blank) left after (or if) a set of pixels (e.g., a set of pixels depicting an object) has been removed from or moved within the digital image. In some cases, the content fills the background corresponding to the digital image. To illustrate, in some implementations, the content fill includes a generated set of pixels to mix with a portion of the background near the object that may be moved/removed. In some cases, the content fill includes a repair segment, such as a repair segment generated from other pixels (e.g., other background pixels) within the digital image. In some cases, the content fill includes other content (e.g., arbitrarily selected content or content selected by a user) to fill a hole or replace another set of pixels.

In one or more embodiments, the content-aware hole-filling machine learning model includes a computer-implemented machine learning model that generates content fills. In particular, in some embodiments, the content-aware hole-filling machine learning model includes a computer-implemented machine learning model that generates content fills for replacement regions in a digital image. For example, in some cases, scene-based image editing system 106 determines that an object has moved within or removed from a digital image and utilizes a content-aware hole filling machine learning model to generate a content fill for holes exposed as a result of the responsive movement/removal. However, as will be discussed in more detail, in some implementations, the scene-based image editing system 106 predicts movement or removal of an object and populates the machine learning model with content-aware holes to pre-generate content fills for the object. In some cases, the hole-filling machine learning model of content perception includes a neural network, such as a repair neural network (e.g., a neural network that uses other pixels of a digital image to generate content fills-more specifically repair segments). In other words, in various implementations, the scene-based image editing system 106 utilizes the content-aware hole-filling machine learning model to provide content at a location of the digital image that does not initially depict such content (e.g., because the location is occupied by another semantic region such as an object).

FIG. 4 illustrates a scene-based image editing system 106 generating a repair digital image 408 from a digital image 402 having a replacement region 404 using a content-aware machine learning model such as a cascade modulation repair neural network (cascaded modulation inpainting neural network) 420, in accordance with one or more embodiments.

In fact, in one or more embodiments, the replacement area 404 includes an area corresponding to the object (and a hole that would exist if the object were moved or deleted). In some embodiments, scene-based image editing system 106 identifies replacement area 404 based on a user's selection of pixels (e.g., pixels depicting objects) to be moved, removed, overlaid, or replaced from the digital image. To illustrate, in some cases, a client device selects an object depicted in a digital image. Thus, scene-based image editing system 106 deletes or removes objects and generates replacement pixels. In some cases, the scene-based image editing system 106 identifies the replacement region 404 by generating object masks via a segmented neural network. For example, the scene-based image editing system 106 utilizes a segmented neural network (e.g., the detection mask neural network 300 discussed above with reference to fig. 3) to detect an object having a digital image and generate an object mask for the object. Thus, in some implementations, the scene-based image editing system 106 generates content fills for the replacement area 404 before receiving user input to move, remove, overlay, or replace pixels that originally occupied the replacement area 404.

As shown, scene-based image editing system 106 utilizes cascade modulation repair neural network 420 to generate replacement pixels for replacement region 404. In one or more embodiments, cascade modulation repair neural network 420 includes a generative antagonistic neural network for generating replacement pixels. In some embodiments, generating a sexual countermeasure neural network (or "GAN") includes a neural network that is tuned or trained via a sexual process to generate an output digital image (e.g., from an input digital image). In some cases, the generative countermeasure neural network includes a plurality of constituent neural networks, such as an encoder neural network and one or more decoder/generator neural networks. For example, the encoder neural network extracts potential codes from noise vectors or digital images. The generator neural network (or a combination of generator neural networks) generates a modified digital image by combining the extracted potential codes (e.g., from the encoder neural network). During training, the discriminator neural network competes with the generator neural network to analyze the generated digital image to generate an authenticity prediction by determining whether the generated digital image is authentic (e.g., from a set of stored digital images) or false (e.g., not from a set of stored digital images). The discriminator neural network also causes the scene-based image editing system 106 to modify parameters of the encoder neural network and/or the one or more generator neural networks to ultimately generate a digital image that spoofs the discriminator neural network to indicate that the generated digital image is a true digital image.

Along these lines, a generative antagonistic neural network refers to a neural network having a specific architecture or a specific purpose, such as a generative repair neural network. For example, the generative repair neural network includes a generative countermeasure neural network that repairs or populates pixels of the digital image with content augmentation (or generates content augmentation to anticipate repairing or populating pixels of the digital image). In some cases, the generative repair neural network repairs the digital image by filling in the hole region (indicated by the object mask). Indeed, as described above, in some embodiments, object masking uses segmentation or masking to define replacement regions that indicate, overlap, overlay, or delineate pixels to be removed or replaced within the digital image.

Thus, in some embodiments, the cascade modulation repair neural network 420 includes a generative repair neural network that utilizes a decoder having one or more cascade modulation decoder layers. In effect, as shown in fig. 4, the cascade modulation repair neural network 420 includes a plurality of cascade modulation decoder layers 410, 412, 414, 416. In some cases, the concatenated modulation decoder layer includes at least two concatenated (e.g., concatenated) modulation blocks for modulating an input signal when generating the repair digital image. To illustrate, in some cases, the concatenated modulation decoder layer includes a first global modulation block and a second global modulation block. Similarly, in some cases, the concatenated modulation decoder layer includes a first global modulation block (which analyzes global features and utilizes a global spatially invariant method) and a second spatial modulation block (which analyzes local features utilizing a spatially varying method). Additional details regarding the modulation blocks (e.g., regarding fig. 5-6) will be provided below.

As shown, scene-based image editing system 106 utilizes a cascade modulation repair neural network 420 (and cascade modulation decoder layers 410, 412, 414, 416) to generate repair digital image 408. In particular, cascade modulation repair neural network 420 generates repair digital image 408 by generating a content fill for replacement region 404. As shown, the replacement area 404 is now filled with content fills having replacement pixels that depict photo-level realism (photo-realistic) scenes that replace the replacement area 404.

As described above, the scene-based image editing system 106 generates a repair digital image using a cascade modulation repair neural network that includes a cascade modulation decoder layer. FIG. 5 illustrates an example architecture of a cascaded modulation repair neural network 502 in accordance with one or more embodiments.

As shown, the cascade modulation repair neural network 502 includes an encoder 504 and a decoder 506. In particular, encoder 504 includes a plurality of convolutional layers 508a-508n of different scales/resolutions. In some cases, the scene-based image editing system 106 feeds a digital image input 510 (e.g., encoding of a digital image) to the first convolution layer 508A to generate encoded feature vectors of a higher scale (e.g., lower resolution). The second convolution layer 508b processes the encoded feature vectors at a higher scale (lower resolution) and generates additional encoded feature vectors (at another higher scale/lower resolution). The cascade modulation repair neural network 502 iteratively generates these encoded feature vectors until a final/highest scale convolution layer 508n is reached and generates a final encoded feature vector representation of the digital image.

As shown, in one or more embodiments, the cascade modulation repair neural network 502 generates a global signature (global feature code) from the final encoded feature vector of the encoder 504. Global feature codes include characteristic representations of digital images from a global (e.g., high-level, high-scale, low-resolution) perspective. In particular, the global feature code comprises a representation of a digital image reflecting the coded feature vector at the highest scale/lowest resolution (or the different coded feature vectors satisfying the threshold scale/resolution).

As shown, in one or more embodiments, the cascade modulation repair neural network 502 applies a neural network layer (e.g., a fully connected layer) to the final encoded feature vector to generate a pattern code 512 (e.g., a pattern vector). In addition, the cascade modulation repair neural network 502 generates the global feature code by combining the pattern code 512 with the random pattern code 514. Specifically, the cascade modulation repair neural network 502 generates the random pattern code 514 by processing the input noise vector with a neural network layer (e.g., a multi-layer perceptron). The neural network layer maps the input noise vector to a random pattern code 514. The cascade modulation repair neural network 502 combines (e.g., concatenates, adds, or multiplies) the random pattern code 514 with the pattern code 512 to generate a global feature code 516. Although fig. 5 illustrates a particular method of generating global feature codes 516, scene-based image editing system 106 can utilize a variety of different methods to generate global feature codes (e.g., without pattern codes 512 and/or random pattern codes 514) that represent encoded feature vectors of encoder 504.

As described above, in some embodiments, the cascade modulation repair neural network 502 generates image encoding using the encoder 504. Image coding is the coded representation of an index image. Thus, in some cases, the image encoding includes one or more encoding feature vectors, style codes, and/or global feature codes.

In one or more embodiments, the cascade modulation repair neural network 502 utilizes multiple fourier convolutional encoder layers to generate image encodings (e.g., encoding feature vectors, pattern codes 512, and/or global feature codes 516). For example, a fourier convolutional encoder layer (or fast fourier convolution) includes a convolutional layer that includes a non-local receptive field and a cross-scale fusion within the convolutional elements. In particular, the fast fourier convolution may include three computations in a single arithmetic unit: a local branch that performs small-kernel convolution, a semi-global branch that processes spectrum-stacked image blocks, and a global branch that manipulates image-level spectra. These three branches complement each other in terms of the different dimensions. Furthermore, in some cases, the fast fourier convolution includes a multi-branch aggregation process for cross-scale fusion. For example, in one or more embodiments, the cascade modulation repair neural network 502 uses a fast fourier convolution layer, as described in Lu Chi, borui Jiang, and Yadong Mu, "Fast Fourier convolution, advances in Neural Information Processing Systems" (fast fourier convolution, new development of neural information processing systems), 33 (2020), which is incorporated herein by reference in its entirety.

Specifically, in one or more embodiments, the cascade modulation repair neural network 502 uses a Fourier convolutional encoding layer for each of the encoder convolutional layers 508a-508 n. Thus, the cascade modulation repair neural network 502 utilizes different fourier convolutional encoding layers with different scales/resolutions to generate encoded feature vectors with improved non-local perceptual fields.

The operation of the encoder 504 may also be described in terms of variables or equations to demonstrate the function of the cascade modulation repair neural network 502. For example, as described above, the cascade modulation repair neural network 502 is an encoder-decoder network with the proposed cascade modulation blocks for image repair in its decoding stage. Specifically, the cascade modulation repair neural network 502 starts with an encoder E having a partial image and masking as inputs to produce a multi-scale feature map from input resolution to resolution 4×4:

wherein the method comprises the steps ofIs a feature generated at a scale 1.ltoreq.i.ltoreq.L (with L being the highest scale or resolution). The encoder is implemented by a set of 2-stride convolutions with residual connections.

In generating the highest scale featuresAfter that, the full connection layer is followed by +.>Normalization will produce global style codes To represent global inputs. In parallel with the encoder, the MLP-based mapping network generates a random pattern code w from the normalized random gaussian noise z, thereby simulating the randomness of the generation process. Furthermore, the scene-based image editing system 106 combines w with S to produce the final global title g= [ S; w (w)]For decoding. As described above, in some embodiments, the scene-based image editing system 106 utilizes the final global title as an image encoding for the digital image.

Similarly, node cluster 1108a includes object attributes 1110a-1110d associated with node 1104a for the side table category, and additional object attributes 1112a-1112g associated with node 1104b for the table category. Thus, node cluster 1108a indicates that object attributes 1110a-1110d are specific to a side table category, while additional object attributes 1112a-1112g are more generally associated with a table category (e.g., associated with all object categories that fall within a table category). In one or more embodiments, the object attributes 1110a-1110d and/or the additional object attributes 1112a-1112g are attributes that have been arbitrarily assigned to their respective object categories (e.g., via user input or system defaults). For example, in some cases, the scene-based image editing system 106 determines that all side tables can support 100 pounds, as shown in fig. 11, regardless of the materials used or the quality of the build. However, in some cases, object attributes 1110a-1110d and/or additional object attributes 1112a-1112g represent object attributes that are common to all objects that fall within a particular category, such as a relatively small side table size. However, in some implementations, the object attributes 1110a-1110d and/or the additional object attributes 1112a-1112g are indicators of object attributes that should be determined for the objects of the respective object categories. For example, in one or more embodiments, upon identifying the side table, the scene-based image editing system 106 determines at least one of a capacity, a size, a weight, or a supporting weight of the side table.

It should be noted that in some embodiments, there is some overlap between the object properties included in the real world category description graph and the property properties included in the image analysis graph. In fact, in many implementations, the object properties are object-specific property properties (rather than properties for settings or scenes of the digital image). Further, it should be noted that the object attributes are merely exemplary and do not necessarily reflect object attributes to be associated with the object class. Indeed, in some embodiments, the object properties shown and their association with a particular object class are configurable to accommodate different needs for editing digital images.

In some cases, a cluster of nodes corresponds to a particular class of objects and presents a class description and/or hierarchy of object components for that particular class. For example, in some implementations, node cluster 1108a corresponds only to a side table category and presents a hierarchy of category descriptions and/or object components related to the side table. Thus, in some cases, when identifying a side table within a digital image, the scene-based image editing system 106 references the node cluster 1108a of the side table category when generating a semantic scene graph, but references another node cluster when identifying a sub-class of another table within the digital image. In some cases, the other node cluster includes several similarities (e.g., similar nodes and edges) to node cluster 1108a, as the other type of table will be included in a sub-category of the table category and include one or more legs.

However, in some implementations, the node clusters correspond to a plurality of different but related object categories and present a common hierarchy of category descriptions and/or object components for those object categories. For example, in some embodiments, node cluster 1108a includes an additional node representing a table category that is connected to node 1104b representing a table category via an edge indicating that the table is also a sub-category of tables. Indeed, in some cases, node cluster 1108a includes nodes representing various sub-categories of the table category. Thus, in some cases, when identifying a table from a digital image, the scene-based image editing system 106 references the node cluster 1108a in generating a semantic scene graph for the digital image, regardless of the subcategory to which the table belongs.

As will be described, in some implementations, object interactions within a digital image are facilitated using a common node cluster for multiple related subcategories. For example, as described above, FIG. 11 illustrates a plurality of individual node clusters. However, as further mentioned, in some cases, the scene-based image editing system 106 includes a classification (e.g., entity classification) that is common between all of the represented objects within the real-world category description map 1102. Thus, in some implementations, the real-world class description map 1102 includes a single network of interconnected nodes, where all clusters of nodes corresponding to individual object classes are connected at a common node, such as a node representing an entity class. Thus, in some embodiments, the real world category description graph 1102 shows the relationships between all represented objects.

In one or more embodiments, the scene-based image editing system 106 generates a real-world class description map, such as the real-world class description map 1102 of FIG. 11, for generating a semantic scene map of a digital image. For example, in some cases, the scene-based image editing system 106 generates a real-world class description map by generating a hierarchy of object classifications for objects that may be represented in a digital image. Thus, in some cases, when generating a semantic scene graph of a digital image, the scene-based image editing system 106 determines object classes of objects depicted therein and associates object classification hierarchies corresponding to the object classes with the objects based on the real world class descriptive graph. In some implementations, to generate a hierarchy of object classifications in the real-world class descriptive map, the scene-based image editing system 106 generates nodes representing object classes, generates one or more nodes representing subclasses of object classes, and generates edges connecting the nodes. In some cases, scene-based image editing system 106 generates nodes representing subclasses of the subclasses and/or nodes higher than object classes within the hierarchy (e.g., nodes representing object classes that are their subclasses).

In some implementations, the scene-based image editing system 106 also generates a real-world class description map by generating a representation of a profile (anatomy) of objects that may be depicted in the digital image. For example, in some cases, scene-based image editing system 106 generates nodes representing components of an object class (e.g., components that include components of objects included in the object class, such as table legs that are components of a table). In some cases, scene-based image editing system 106 generates edges that connect the nodes representing the components to nodes representing the respective object classes.

In one or more embodiments, the scene-based image editing system 106 utilizes behavior policies in generating a semantic scene graph of a digital image. FIG. 12 illustrates a behavior policy diagram 1202 used by the scene-based image editing system 106 in generating a semantic scene graph in accordance with one or more embodiments.

In one or more embodiments, the behavior policy map includes a template map that describes behavior of objects depicted in the digital image based on a context in which the objects are depicted. In particular, in some embodiments, the behavior policy map includes a template map that assigns behaviors to objects depicted in the digital image based on semantic understanding of the objects depicted in the digital image and/or their relationships to other objects depicted in the digital image. Indeed, in one or more embodiments, the behavior policy map includes various relationships between various types of objects (e.g., object classes) and specifies behaviors for those relationships. Thus, in some embodiments, the behavior policy graph assigns behaviors to object classes based on object relationships. In some cases, scene-based image editing system 106 includes a behavior policy map as part of a semantic scene graph. In some implementations, as will be discussed further below, behavior policies are separated from the semantic scene graph, but plug-in behavior is provided based on semantic understanding and relationships of objects represented in the semantic scene graph.

As shown in FIG. 12, behavior policy graph 1202 includes a plurality of relationship indicators 1204a-1204e and a plurality of behavior indicators 1206a-1206e associated with relationship indicators 1204a-1204 e. In one or more embodiments, the relationship indicators 1204a-1204e reference a relationship subject (e.g., an object in a digital image that is the subject of the relationship) and a relationship object (e.g., an object in a digital image that is the object of the relationship). For example, the relationship indicators 1204a-1204e of FIG. 12 indicate that the relationship subject is "supported" by or is "a part of the relationship object. Further, in one or more embodiments, the behavior indicators 1206a-1206e assign behaviors to the relationship subjects (e.g., indicate that the relationship subjects "move together" or "delete together" with the relationship object). In other words, the behavior indicators 1206a-1206e provide modification instructions for the relationship subject when the relationship subject is modified.

FIG. 12 provides a small subset of the relationships identified by the scene-based image editing system 106 in various embodiments. For example, in some implementations, relationships identified by scene-based image editing system 106 and incorporated into the generated semantic scene graph include, but are not limited to, relationships described as "above," "below," "back," "front," "contact," "held," "holding," "supporting," "standing on," "worn," "wearing," "resting on," "being viewed" or "looking" at. Indeed, as described above, in some implementations, scene-based image editing system 106 utilizes pairs of relationships to describe relationships between objects in two directions. For example, in some cases, where it is described that a first object is "supported" by a second object, the scene-based image editing system 106 also describes that the second object is "supporting" the first object. Thus, in some cases, behavior policy map 1202 includes these relationship pairs, and scene-based image editing system 106 includes information in the semantic scene graph accordingly.

As further shown, the behavior policy diagram 1202 also includes a plurality of classification indicators 1208a-1208e associated with the relationship indicators 1204a-1204 e. In one or more embodiments, the classification indicators 1208a-1208e indicate the object class to which the assigned behavior applies. In fact, in one or more embodiments, the classification indicators 1208a-1208e reference the object class of the correspondence object. As shown in FIG. 12, the classification indicators 1208a-1208e indicate that an action is assigned to an object class that is a sub-class of the specified object class. In other words, FIG. 12 shows that the classification indicators 1208a-1208e reference a particular object class and indicate that the assigned behavior applies to all objects that fall within the object class (e.g., an object class that is part of a sub-class that falls within the object class).

The level of commonality or specificity of the specified object categories referenced by the classification indicators within their corresponding object classification hierarchies varies in various embodiments. For example, in some embodiments, the classification indicator references the lowest classification level (e.g., the most specific classification applicable) such that there are no subcategories and the corresponding behavior applies only to those objects having that particular object's lowest classification level. On the other hand, in some implementations, the classification indicator references the highest classification level (e.g., the most general classification applicable) or some other level above the lowest classification level such that the corresponding behavior applies to objects associated with one or more of the plurality of classification levels that exist within the specified classification level.

To provide a description of how behavior policy graph 1202 indicates assigned behavior, relationship indicator 1204a indicates a "supporting" relationship between an object (e.g., a relationship subject) and another object (e.g., a relationship object). The behavior indicator 1206a indicates a "move together" behavior associated with a "support" relationship, and the classification indicator 1208a indicates that the particular behavior applies to objects within a specified object class. Thus, in one or more embodiments, behavior policy diagram 1202 shows that an object that falls into a specified object class and has a "supporting" relationship with another object will exhibit "move together" behavior. In other words, if a first object specifying an object class is depicted in a digital image supported by a second object and the digital image is modified to move the second object, then the scene-based image editing system 106 automatically moves the first object with the second object as part of the modification according to the behavior policy map 1202. In some cases, rather than automatically moving the first object, the scene-based image editing system 106 provides suggestions for moving the first object for display within a graphical user interface for modifying the digital image.

As shown in fig. 12, some relationship indicators (e.g., relationship indicators 1204a-1204b or relationship indicators 1204c-1204 e) refer to the same relationship, but are associated with different behaviors. Indeed, in some implementations, behavior policy map 1202 assigns multiple behaviors to the same relationship. In some cases, the differences are due to differences in the specified subclasses. In particular, in some embodiments, the scene-based image editing system 106 assigns objects of one object class a particular behavior for a particular relationship, but assigns objects of another object class a different behavior for the same relationship. Thus, in the configuration behavior policy diagram 1202, the scene-based image editing system 106 manages different object categories differently in various embodiments.

In one or more embodiments, the scene-based image editing system 106 generates a behavior policy map, such as behavior policy map 1202 of FIG. 12, for generating a semantic scene map of a digital image. For example, in some cases, scene-based image editing system 106 generates a relationship indicator corresponding to the object relationship, a classification indicator corresponding to the object class, and a behavior indicator corresponding to the behavior. The scene-based image editing system 106 associates various indicators in the behavior policy map. In particular, the scene-based image editing system 106 associates various indicators to represent relationships of object classes and behaviors assigned to object classes based on these relationships. To illustrate, in some cases, the scene-based image editing system 106 associates each object class with at least one behavior based on a relationship between the object class and some other object class (e.g., when the relationship is specified but the other object class is not specified).

As shown in FIG. 12, in some cases, scene-based image editing system 106 assigns multiple behaviors to an object class based on different relationships associated with the object class. Further, in some instances, scene-based image editing system 106 assigns multiple behaviors to an object class based on a single relationship associated with the object class. Thus, in various embodiments, the number and types of behaviors assigned to an object class based on its associated object relationships are different.

In some cases, the scene-based image editing system 106 generates different behavior policies for different editing contexts. Indeed, in some embodiments, the scene-based image editing system 106 generates different behavior policy graphs that assign different behavior sets to object classes based on their object relationships. For example, in some implementations, the scene-based image editing system 106 generates different behavior policies for use by different client devices or for use in different editing contexts for a particular client device. To illustrate, in some cases, the scene-based image editing system 106 generates a first behavior policy map for a first set of user preferences and generates a second behavior policy map for a second set of user preferences. Thus, even when editing is performed on the same client device, the scene-based image editing system 106 determines which behavior policy graph to use based on the active set of user preferences. Thus, in some cases, the scene-based image editing system 106 generates behavior policy maps in response to user inputs that establish a corresponding set of user preferences. As another example, in some implementations, the scene-based image editing system 106 generates a first behavior policy map for use by a first editing application and generates a second behavior policy map for use by a second editing application. Thus, the scene-based image editing system 106 may associate a behavior policy map with a particular editing context (e.g., a particular set of user preferences, a particular editing application, or a particular client device) and invoke the behavior policy map when its corresponding editing context application.

While much of the discussion is provided in the context of generating a semantic scene graph of a digital image, in some cases, the scene-based image editing system 106 utilizes the behavior policy graph itself (e.g., without using the semantic scene graph) when modifying the digital image. For example, in some embodiments, scene-based image editing system 106 generates behavior policy maps, receives digital images, determines behaviors of objects depicted in the digital images using the behavior policy maps, and modifies one or more objects within the digital images based on the behaviors. In particular, the scene-based image editing system 106 modifies an object based on a relationship with another object (and its associated behavior) being targeted for modification.

FIG. 13 illustrates a semantic scene graph 1302 generated by a scene-based image editing system 106 for a digital image in accordance with one or more embodiments. In particular, the semantic scene graph 1302 shown in FIG. 13 is a simplified example of a semantic scene graph and in various embodiments does not depict all of the information included in the semantic scene graph generated by the scene-based image editing system 106.

As shown in fig. 13, a semantic scene graph 1302 is organized according to the image analysis graph 1000 described above with reference to fig. 10. In particular, semantic scene graph 1302 includes a single network of interconnected nodes that reference characteristics of a digital image. For example, semantic scene graph 1302 includes nodes 1304a-1304c representing depicted objects as indicated by their connection to node 1306. In addition, semantic scene graph 1302 includes relationship indicators 1308a-1308c that represent relationships between objects corresponding to nodes 1304a-1304 c. As further shown, the semantic scene graph 1302 includes nodes 1310 that represent commonalities between objects (e.g., where the objects are all included in the digital image, or where the objects indicate a subject or topic of the digital image). Further, as illustrated, semantic scene graph 1302 includes property attributes 1314a-1314f for objects corresponding to nodes 1304a-1304 c.

As further shown in fig. 13, the semantic scene graph 1302 includes context information from the real world category description graph 1102 described above with reference to fig. 11. In particular, semantic scene graph 1302 includes nodes 1312a-1312c that indicate object categories to which objects corresponding to nodes 1304a-1304c belong. Although not shown in FIG. 11, semantic scene graph 1302 also includes a full-level structure of object classifications for each object class represented by nodes 1312a-1312 c. However, in some cases, each of the nodes 1312a-1312c includes a pointer to their respective object classification hierarchy within the real world category description map 1102. Further, as shown in FIG. 13, semantic scene graph 1302 includes object attributes 1316a-1316e for the object categories represented therein.

Further, as shown in fig. 13, semantic scene graph 1302 includes behaviors from behavior policy graph 1202 described above with reference to fig. 12. In particular, semantic scene graph 1302 includes behavior indicators 1318a-1318b, where behavior indicators 1318a-1318b indicate behavior of objects represented therein based on their associations.

FIG. 14 shows an illustration for generating a semantic scene graph for a digital image using a template graph in accordance with one or more embodiments. In effect, as shown in FIG. 14, the scene-based image editing system 106 utilizes one or more neural networks 1404 to analyze the digital image 1402. In particular, in one or more embodiments, the scene-based image editing system 106 utilizes one or more neural networks 1404 to determine various characteristics of the digital image 1402 and/or corresponding characteristic attributes thereof. For example, in some cases, scene-based image editing system 106 utilizes a segmented neural network to identify and classify objects depicted in a digital image (as discussed above with reference to fig. 3). Furthermore, in some embodiments, scene-based image editing system 106 utilizes a neural network to determine relationships between objects and/or their object properties, as will be discussed in more detail below.

In one or more implementations, the scene-based image editing system 106 utilizes a depth estimation neural network to estimate the depth of objects in the digital image and stores the determined depth in the semantic scene graph 1412. For example, the scene-based image editing system 106 utilizes a depth estimation neural network as described in U.S. application Ser. No. 17/186,436, entitled "GENERATING DEPTH IMAGES UTILIZING A MACHINE-LEARNING MODEL BUILT FROM MIXEDDIGITAL IMAGE SOURCES AND MULTIPLE LOSS FUNCTION SETS" (generating depth images using machine learning models built from hybrid digital image sources and multiple loss function sets), filed on month 26 of 2021, which is incorporated herein by reference in its entirety. Alternatively, the scene-based image editing system 106 utilizes a depth refined neural network as described in U.S. application 17/658,873 entitled "UTILIZING MACHINE LEARNING MODELS TO GENERATE REFINED DEPTH MAPS WITH SEGMENTATION MASK GUIDANCE" (generating a refined depth map with segmentation masking guidance using a machine learning model) filed on day 4, month 12 of 2022, which is incorporated herein by reference in its entirety. Then, when editing the object to perform real scene editing, the scene-based image editing system 106 accesses depth information (e.g., the average depth of the object) of the object from the semantic scene graph 1412. For example, as objects are moved within an image, scene-based image editing system 106 then accesses depth information for objects in the digital image from semantic scene graph 1412 to ensure that the moved objects are not placed in front of objects having a smaller depth.

In one or more implementations, the scene-based image editing system 106 utilizes a depth estimation neural network to estimate illumination parameters of objects or scenes in the digital image and store the determined illumination parameters in the semantic scene graph 1412. FOR example, the scene-based image editing system 106 utilizes SOURCE-specific illumination to estimate a neural network, as described in U.S. application Ser. No. 16/558,975, entitled "DYNAMICALLY ESTIMATING LIGHT-SOURCE-SPECIFIC PARAMETERS FOR DIGITAL IMAGES USING A NEURAL NETWORK" (dynamic estimation of light SOURCE-specific parameters of digital images utilizing neural networks), filed on 3, 9, 2019, which is incorporated herein by reference in its entirety. Then, when editing an object to perform real scene editing, scene-based image editing system 106 accesses the lighting parameters of the object or scene from semantic scene graph 1412. For example, when moving objects within an image or inserting new objects in a digital image, scene-based image editing system 106 accesses illumination parameters from semantic scene graph 1412 to ensure that objects moving/placed within the digital image have realistic illumination.

In one or more implementations, the scene-based image editing system 106 utilizes a depth estimation neural network to estimate illumination parameters of objects or scenes in the digital image and stores the determined illumination parameters in the semantic scene graph 1412. FOR example, the scene-based image editing system 106 utilizes SOURCE-specific illumination to estimate a neural network, as described in U.S. application Ser. No. 16/558,975, entitled "DYNAMICALLY ESTIMATING LIGHT-SOURCE-SPECIFIC PARAMETERS FOR DIGITAL IMAGES USING A NEURAL NETWORK" (dynamic estimation of light SOURCE-specific parameters of digital images utilizing neural networks), filed on 3, 9, 2019, which is incorporated herein by reference in its entirety. Then, when editing an object to perform real scene editing, the scene-based image editing system 106 accesses lighting parameters for the object or scene from the semantic scene graph 1412. For example, when moving objects within an image or inserting new objects in a digital image, scene-based image editing system 106 accesses illumination parameters from semantic scene graph 1412 to ensure that objects moving/placed within the digital image have realistic illumination.

As further shown in fig. 14, the scene-based image editing system 106 utilizes the output of the one or more neural networks 1404 along with the image analysis map 1406, the real-world category description map 1408, and the behavioral policy map 1410 to generate a semantic scene map 1412. In particular, the scene-based image editing system 106 generates a semantic scene graph 1412 to include a description of the digital image 1402 based on the structure, property attributes, hierarchical structure of object classifications, and behaviors provided by the image analysis graph 1406, the real world category description graph 1408, and the behavior policy graph 1410.

As previously described, in one or more embodiments, the image analysis map 1406, the real world category description map 1408, and/or the behavior policy map 1410 are predetermined or pre-generated. In other words, the scene-based image editing system 106 pre-generates, constructs, or otherwise determines the content and organization of each graph prior to implementation. For example, in some cases, scene-based image editing system 106 generates image analysis map 1406, real-world category description map 1408, and/or behavior policy map 1410 based on user input.

Further, in one or more embodiments, the image analysis map 1406, the real world category description map 1408, and/or the behavior policy map 1410 are configurable. In practice, the data represented therein may be reconfigured, reorganized, and/or added or removed based on preferences or the need to edit the digital image. For example, in some cases, the behavior assigned by behavior policy map 1410 works in some image editing contexts, but not in other image editing contexts. Thus, when editing an image in another image editing context, the scene-based image editing system 106 implements one or more neural networks 1404 and image analysis graphs 1406, but implements different behavior policy graphs (e.g., behavior policy graphs configured to meet the preferences of the image editing context). Thus, in some embodiments, the scene-based image editing system 106 modifies the image analysis map 1406, the real-world category description map 1408, and/or the behavior policy map 1410 to accommodate different image editing contexts.

For example, in one or more implementations, the scene-based image editing system 106 determines a context for selecting a behavior policy map by identifying a user type. In particular, the scene-based image editing system 106 generates a plurality of behavior policy maps for various types of users. For example, the scene-based image editing system 106 generates a first behavior policy map for a novice or new user. In one or more implementations, the first behavior policy map includes a greater number of behavior policies than the second behavior policy map. In particular, for newer users, the scene-based image editing system 106 utilizes a first action policy map that provides greater action automation and less control to the user. On the other hand, the scene-based image editing system 106 uses a second behavior policy map with fewer behavior policies than the first behavior policy map for the advanced user. In this way, the scene-based image editing system 106 provides higher-level users with greater control over the relationship-based actions (automatic move/delete/edit) of objects based on relationships. In other words, by utilizing the second behavior policy map for advanced users, the scene-based image editing system 106 performs less automatic editing of related objects.

In one or more implementations, the scene-based image editing system 106 determines a context for selecting a behavior policy map based on visual content of the digital image (e.g., a type of object depicted in the digital image), an editing application being used, and the like. Thus, in one or more implementations, the scene-based image editing system 106 selects/uses the behavior policy map based on the image content, the user type, the editing application being used, or another context.

Furthermore, in some embodiments, scene-based image editing system 106 utilizes a graph in analyzing a plurality of digital images. Indeed, in some cases, the image analysis map 1406, the real world category description map 1408, and/or the behavior policy map 1410 are not specifically targeted to a particular digital image. Thus, in many cases, these figures are generic and reused by the scene-based image editing system 106 for multiple instances of digital image analysis.

In some cases, the scene-based image editing system 106 also implements one or more mappings to map between the output of the one or more neural networks 1404 and the data schemas of the image analysis map 1406, the real-world category description map 1408, and/or the behavior policy map 1410. As one example, in various embodiments, scene-based image editing system 106 utilizes various segmented neural networks to identify and classify objects. Thus, depending on the segmented neural network used, the resulting classification of a given object may be different (e.g., different terms (words) or different levels of abstraction). Thus, in some cases, the scene-based image editing system 106 utilizes a mapping that maps particular outputs of the segmented neural networks to object categories represented in the real-world category description map 1408, allowing the real-world category description map 1408 to be used in conjunction with multiple neural networks.

FIG. 15 shows another illustration of a semantic scene graph for generating a digital image in accordance with one or more embodiments. In particular, FIG. 15 illustrates an example framework for generating a semantic scene graph by scene-based image editing system 106 in accordance with one or more embodiments.

As shown in fig. 15, scene-based image editing system 106 identifies input image 1500. In some cases, scene-based image editing system 106 identifies input image 1500 based on a request. For example, in some cases, the request includes a request to generate a semantic scene graph for the input image 1500. In one or more implementations, the request to analyze the input image includes the scene-based image editing system 106 accessing, opening, or displaying the input image 1500.

In one or more embodiments, in response to the request, scene-based image editing system 106 generates object suggestions and sub-graph suggestions for input image 1500. For example, in some embodiments, scene-based image editing system 106 utilizes object suggestion network (object proposal network) 1520 to extract a set of object suggestions for input image 1500. To illustrate, in some cases, the scene-based image editing system 106 extracts a set of object suggestions for humans detected within the input image 1500, object(s) worn by humans, object(s) near humans, building, plant, animal, background object or scene (including sky or objects in the sky), and so forth.

In one or more embodiments, the object suggestion network 1520 includes the detection masking neural network 300 (specifically, the object detection machine learning model 308) discussed above with reference to fig. 3. In some cases, the object suggestion network 1520 includes a neural network, such as a regional suggestion network ("RPN"), that is part of a region-based convolutional neural network to extract a set of object suggestions represented by a plurality of bounding boxes. One example RPN is "fast r-cnn" at s.ren, k.he, r.girshick and j.sun: towards real-time object detection with region proposal networks "(faster r-cnn: real-time target detection using regional advice networks), disclosed in NIPS,2015, the entire contents of which are incorporated herein by reference. As an example, in some cases, the scene-based image editing system 106 uses RPNs to extract object suggestions of important objects (e.g., detectable objects or objects having a threshold size/visibility) within the input image. The following algorithm represents one embodiment of an object suggestion set:

[o ₀ ，...，0 _N-1 ]＝f _RPN (I)

where I is the input image, fRPN (-) represents the RPN network, and oi is the ith object suggestion.

In some implementations, in conjunction with determining object suggestions, the scene-based image editing system 106 also determines coordinates of each object suggestion relative to the size of the input image 1500. Specifically, in some cases, the location of the object suggestion is based on a bounding box containing the visible portion(s) of the object within the digital image. To illustrate, for oi, the coordinates of the corresponding bounding box are defined by r _i ＝[x _i ，y _i ，w _i ，h _i ]Representation, wherein (x _i ，y _i ) Is the upper left corner coordinate, and w _i And h _i The width and height of the bounding box, respectively. Thus, the scene-based image editing system 106 determines the relative position of each important object or entity in the input image 1500 and stores the position data along with the set of object suggestions.

As described above, in some implementations, scene-based image editing system 106 also determines sub-graph suggestions for the object suggestions. In one or more embodiments, the sub-graph suggestion indicates a relationship that relates to a particular object suggestion in the input image 1500. It will be appreciated that any two different objects (oi, oj) in a digital image may correspond to two possible relationships in opposite directions. As an example, a first object may be "above" a second object, and the second object may be "below" the first object. Since each pair of objects has two possible relationships, the total number of possible relationships suggested by the N objects is N (N-1). Thus, in a system that attempts to determine all possible relationships in two directions for each object suggestion of an input image, more object suggestions results in a larger scene graph than fewer object suggestions, while increasing computational costs and reducing the inference speed of object detection.

The sub-graph suggestions reduce the number of potential relationships analyzed by the scene-based image editing system 106. In particular, as previously described, the sub-graph suggestions represent relationships involving two or more particular object suggestions. Thus, in some cases, scene-based image editing system 106 determines sub-graph suggestions for input image 1500 to reduce the number of potential relationships by clustering rather than maintaining N (N-1) possible relationships. In one or more embodiments, the scene-based image editing system 106 uses a "decomposable net" in y.li, w.ouyang, b.zhou, y.cui, j.shi, and x.wang: an efficient subgraph-based scene graph generation framework, "ECCV, clustering and subgraph suggestion generation process described in 29 of 2018, 6, incorporated herein by reference in its entirety.

As an example, for an object suggestion pair, scene-based image editing system 106 determines a subgraph based on a confidence score associated with the object suggestion. To illustrate, the scene-based image editing system 106 generates each object suggestion with a confidence score that indicates a confidence that the object suggestion is a correct match with a corresponding region of the input image. The scene-based image editing system 106 also determines sub-graph suggestions for the object suggestion pairs based on a combined confidence score that is the product of the confidence scores of the two object suggestions. The scene-based image editing system 106 also constructs the sub-graph suggestions as joint boxes of object suggestions with combined confidence scores.

In some cases, scene-based image editing system 106 also suppresses sub-graph suggestions to represent candidate relationships as two objects and one sub-graph. In particular, in some embodiments, scene-based image editing system 106 utilizes non-maximum suppression to represent candidate relationships asWherein i +.j and +.>Is with o _i The kth sub-graph, o, of all sub-graphs associated _i Subgraph of (a) includes o _j And potentially other object suggestions. After suppressing the sub-graph suggestion, scene-based image editing system 106 represents each object and sub-graph as feature vectors, respectively>And feature map->Wherein D and Ka are dimensions.

After determining the object suggestions and sub-graph suggestions for objects in the input image, the scene-based image editing system 106 retrieves and embeds the relationships from the external knowledge base 1522. In one or more embodiments, the external knowledge base includes a dataset that relates to semantic relationships of objects. In particular, in some embodiments, the external knowledge base includes a semantic network that includes descriptions of relationships (also referred to herein as "common sense relationships") between objects based on background knowledge and contextual knowledge. In some implementations, the external knowledge base includes a database on one or more servers that includes knowledge of relationships from one or more sources, including expert-created resources, crowd-sourced resources, web-based sources, dictionaries, or other sources that include information about object relationships.

Furthermore, in one or more embodiments, embedding includes a representation of the relationship with the object as a vector. For example, in some cases, the relationship embedding includes using a vector representation of triples of relationships (i.e., object tags, one or more relationships, and object entities) extracted from an external knowledge base.

Indeed, in one or more embodiments, the scene-based image editing system 106 communicates with the external knowledge base 1522 to obtain useful object relationship information for improving object and sub-graph suggestions. Furthermore, in one or more embodiments, scene-based image editing system 106 refines the object suggestions and sub-graph suggestions (represented by block 1524) using the embedded relationships, as described in more detail below.

In some embodiments, in preparation for retrieving the relationships from the external knowledge base 1522, the scene-based image editing system 106 performs an internal refinement process on the objects and sub-graph suggestions (e.g., in preparation for refining the features of the objects and sub-graph suggestions). In particular, scene-based image editing system 106 uses each object o _i Connected to a set of subgraphs ⁱ And each subgraph s _k And a group of objects O ^k The associated knowledge refines the object vector (representing the sub-graph) by focusing on the associated sub-graph feature graph (representing the associated object vector). For example, in some cases, the internal refinement process is expressed as:

Wherein the method comprises the steps of(representation->) Is an indication of about>(representation->) Transfer to o _i (representing the output of the softmax layer to the weight of sk), and f _s→o And f _o→s And is a nonlinear mapping function. In one or more embodiments, due to o _i And sk, the scene-based image editing system 106 applies pooling or spatial location-based attention to s→o or o→s refinement.

In some embodiments, once the internal refinement is completed, scene-based image editing system 106 extracts from the initially refined object feature vectorsThe object tags are predicted and matched with corresponding semantic entities in the external knowledge base 1522. In particular, the scene-based image editing system 106 accesses the external knowledge base 1522 to obtain the most common relationships corresponding to object tags. The scene-based image editing system 106 also selects a predetermined number of most common relationships from the external knowledge base 1522 and refines the features of the corresponding object suggestions/feature vectors using the retrieved relationships.

In one or more embodiments, after refining the object suggestions and sub-graph suggestions using the embedded relationship, the scene-based image editing system 106 predicts object tags 1502 and predicate tags from the refinement suggestions. In particular, scene-based image editing system 106 predicts labels based on refined object/sub-graph features. For example, in some cases, scene-based image editing system 106 directly predicts each object tag using refined features of the corresponding feature vector. In addition, scene-based image editing system 106 predicts predicate labels (e.g., relationship labels) based on the subject and object feature vectors, in combination with its corresponding subgraph feature map resulting from the subgraph feature being associated with several object suggestion pairs. In one or more embodiments, the inference process for predicting tags is as follows:

Wherein f _rel (. Cndot.) and f _node (. Cndot.) represents the mapping layers for predicate and object identification, respectivelyRepresenting a convolution operation. Furthermore, the->Representing refined feature vectors based on relationships extracted from an external knowledge base.

In one or more embodiments, scene-based image editing system 106 also uses the predicted tags to generate semantic scene graph 1504. In particular, the scene-based image editing system 106 uses the object tags 1502 and predicate tags from refinement features to create a graphical representation of semantic information of the input image 1500. In one or more embodiments, scene-based image editing system 106 generates a scene graph asWherein->Is a scene graph.

Thus, the scene-based image editing system 106 utilizes the relative positions of objects and their tags in conjunction with the external knowledge base 1522 to determine relationships between objects. The scene based image editing system 106 utilizes the determined relationships in generating the behavior policy map 1410. As an example, the scene-based image editing system 106 determines that the hand and the cell phone have overlapping positions within the digital image. Based on the relative position and depth information, the scene-based image editing system 106 determines that the person (associated with the hand) has a relationship of "holding" the cell phone. As another example, scene-based image editing system 106 determines that a person and a shirt have overlapping positions and overlapping depths within a digital image. Based on the relative position and relative depth information, the scene-based image editing system 106 determines that the person has a relationship of "wearing" a shirt. On the other hand, the scene-based image editing system 106 determines that the person and the shirt have overlapping positions and that the shirt has an average depth that is greater than an average depth of the person within the digital image. Based on the relative position and relative depth information, the scene-based image editing system 106 determines that the person has a relationship that is "in front of" the shirt.

By generating a semantic scene graph for a digital image, the scene-based image editing system 106 provides improved flexibility and efficiency. Indeed, as described above, because the features used in modifying the digital image when user interactions are received to perform the modification are readily available, the scene-based image editing system 106 generates semantic scene graphs to provide improved flexibility. Thus, the scene-based image editing system 106 reduces the user interaction typically required under conventional systems to determine those features (or generate desired content, such as bounding boxes or object masks) in preparation for modification. Thus, the scene-based image editing system 106 provides a more efficient graphical user interface that requires less user interaction to modify the digital image.

In addition, by generating a semantic scene graph for the digital image, the scene-based image editing system 106 provides the ability to edit two-dimensional images like real world scenes. For example, based on the generated semantic scene graph of images generated using various neural networks, the scene-based image editing system 106 determines objects, their properties (location, depth, material, color, weight, size, labels, etc.). The scene-based image editing system 106 utilizes the information of the semantic scene graph to intelligently edit the image as if the image were a real-world scene.

Indeed, in one or more embodiments, the scene-based image editing system 106 facilitates modification of digital images using semantic scene graphs generated for the digital images. For example, in one or more embodiments, scene-based image editing system 106 facilitates modifying one or more object properties of objects depicted in a digital image using a corresponding semantic scene graph. FIGS. 16-21C illustrate modifying one or more object properties of an object depicted in a digital image in accordance with one or more embodiments.

Many conventional systems are inflexible in that they often require difficult and tedious workflows to make targeted modifications to specific object properties of objects depicted in digital images. Indeed, under such systems, modifying object properties typically requires manual manipulation of the object properties. For example, modifying the shape of an object depicted in a digital image typically requires multiple user interactions to manually reconstruct the boundary of the object (typically at the pixel level), while modifying the size typically requires cumbersome interactions with a resizing tool to resize and ensure scale. Thus, in addition to being inflexible, many conventional systems suffer from inefficiency, as the process required by these systems to perform such targeted modifications typically involves a significant amount of user interaction.

The scene-based image editing system 106 provides advantages over conventional systems by operating with improved flexibility and efficiency. In effect, by presenting graphical user interface elements through which user interactions can be targeted at object properties of objects, the scene-based image editing system 106 provides greater flexibility in terms of interactivity of objects depicted in the digital image. In particular, via graphical user interface elements, scene-based image editing system 106 provides flexible selection and modification of object properties. Thus, the scene-based image editing system 106 also provides improved efficiency by reducing the user interaction required to modify object properties. Indeed, as will be discussed below, the scene-based image editing system 106 enables user interactions to interact with a description of object properties in order to modify the object properties, thereby avoiding the difficult, cumbersome workflow of user interactions required by many conventional systems.

As suggested, in one or more embodiments, scene-based image editing system 106 facilitates modifying object properties of objects depicted in a digital image by determining object properties of those objects. Specifically, in some cases, scene-based image editing system 106 utilizes a machine learning model, such as an attribute classification neural network, to determine object attributes. Fig. 16-17 illustrate an attribute classification neural network used by scene-based image editing system 106 to determine object attributes of an object in accordance with one or more embodiments. In particular, fig. 16-17 illustrate multi-attribute contrast classification neural networks used by the scene-based image editing system 106 in one or more embodiments.

In one or more embodiments, the attribute classification neural network includes a computer-implemented neural network that identifies object attributes of objects depicted in the digital image. In particular, in some embodiments, the attribute classification neural network includes a computer-implemented neural network that analyzes objects depicted in the digital image, identifies object attributes of the objects, and in response provides labels corresponding to the object attributes. It should be appreciated that in many cases, attribute classification neural networks more broadly identify and classify attributes of semantic regions depicted in digital images. Indeed, in some implementations, the attribute classification neural network determines attributes of semantic regions depicted in the digital image other than the object (e.g., foreground or background).

FIG. 16 illustrates an overview of a multi-attribute contrast classification neural network in accordance with one or more embodiments. In particular, fig. 16 illustrates a scene-based image editing system 106 that utilizes a multi-attribute contrast classification neural network to extract various attribute tags (e.g., negative, positive, and unknown tags) of objects depicted within a digital image.

As shown in fig. 16, the scene-based image editing system 106 utilizes an embedded neural network 1604 with a digital image 1602 to generate an image-object feature map 1606 and a low-level attribute feature map 1610. In particular, the scene-based image editing system 106 generates an image-object feature map 1606 (e.g., image-object feature map X) by combining the object-tag embedding vector 1608 with the high-level attribute feature map from the embedded neural network 1604. For example, object-tag embedding vector 1608 represents the embedding of an object tag (e.g., "chair").

Further, as shown in fig. 16, the scene-based image editing system 106 generates a local image-object feature vector Zrel. In particular, the scene-based image editing system 106 utilizes the image-object feature map 1606 with the locator neural network 1612 to generate the local image-object feature vector Zrel. In particular, the scene-based image editing system 106 combines the image-object feature map 1606 with local object attention feature vectors 1616 (denoted G) to generate local image-object feature vectors Zrel to reflect the segmented predictions of the relevant objects (e.g., "chairs") depicted in the digital image 1602. As further shown in FIG. 16, in some embodiments, the locator neural network 1612 is trained using a reference truth object segmentation mask 1618.

In addition, as shown in FIG. 16, scene-based image editing system 106 also generates local low-level attribute feature vector Zlow. In particular, referring to FIG. 16, the scene-based image editing system 106 utilizes the local object attention feature vector G from the locator neural network 1612 with the low-level attribute feature map 1610 to generate a local low-level attribute feature vector Zlow.

Further, as shown in fig. 16, the scene-based image editing system 106 generates a multi-attention feature vector Zatt. As shown in fig. 16, scene-based image editing system 106 generates multi-attention feature vector Zatt from image-object feature map 1606 by using attention map 1620 of multi-attention neural network 1614. Indeed, in one or more embodiments, the scene-based image editing system 106 utilizes the multi-attention feature vector Zatt to focus on features at different spatial locations related to objects depicted within the digital image 1602, while predicting attribute tags of the depicted objects.

As further shown in fig. 16, the scene-based image editing system 106 utilizes the classifier neural network 1624 to predict the attribute tags 1626 when generating a local image-object feature vector Zrel, a local low-level attribute feature vector Zlow, and a multi-attention feature vector Zatt (collectively referred to as vectors 1622 in fig. 16). Specifically, in one or more embodiments, the scene-based image editing system 106 utilizes a classifier neural network 1624 having a cascade of local image-object feature vectors Zrel, local low-level attribute feature vectors Zlow, and multi-attention feature vectors Zatt to determine attribute tags 1626 for objects (e.g., chairs) depicted within the digital image 1602. As shown in fig. 16, the scene-based image editing system 106 determines positive attribute tags for chairs depicted in the digital image 1602, negative attribute tags that are not attributes of chairs depicted in the digital image 1602, and unknown attribute tags corresponding to attribute tags that the scene-based image editing system 106 cannot be confidently classified as belonging to chairs depicted in the digital image 1602 using the classifier neural network 1624.

In some cases, the scene-based image editing system 106 utilizes probabilities (e.g., probability scores, floating point probabilities) output by the classifier neural network 1624 for particular attributes to determine whether the attributes are classified as positive, negative, and/or unknown attribute tags for objects (e.g., chairs) depicted in the digital image 1602. For example, when the probability output of a particular attribute meets a positive attribute threshold (e.g., positive probability, probability greater than 0.5), the scene-based image editing system 106 identifies the attribute as a positive attribute. Further, when the probability output of a particular attribute meets a negative attribute threshold (e.g., negative probability, probability below-0.5), the scene-based image editing system 106 identifies the attribute as a negative attribute. Furthermore, in some cases, when the probability output of a particular attribute does not meet a positive attribute threshold or a negative attribute threshold, the scene-based image editing system 106 identifies the attribute as an unknown attribute.

In some cases, the feature map includes height, width, and dimension positions (h×w×d) with D-dimensional feature vectors at each h×w image position. Further, in some embodiments, the feature vector includes a set of values representing characteristics and/or features of content (or objects) within the digital image. Indeed, in some embodiments, the feature vector includes a set of values corresponding to potential and/or salient attributes associated with the digital image. For example, in some cases, a feature vector is a multi-dimensional dataset representing features depicted within a digital image. In one or more embodiments, the feature vector includes a set of numerical metrics learned by a machine learning algorithm.

FIG. 17 illustrates an architecture of a multi-attribute contrast classification neural network in accordance with one or more embodiments. Indeed, in one or more embodiments, as shown in fig. 17, the scene-based image editing system 106 utilizes a multi-attribute contrast classification neural network with embedded neural network, locator neural network, multi-attention neural network, and classifier neural network components to determine positive and negative attribute tags (e.g., in terms of output attribute presence probabilities) for objects depicted in the digital image.

As shown in fig. 17, the scene-based image editing system 106 utilizes embedded neural networks within the multi-attribute contrast classification neural network. In particular, as shown in fig. 17, the scene-based image editing system 106 utilizes a low-level embedding layer 1704 (e.g., of the embedded neural network 1604 of fig. 16) (e.g., of the embedding NNl) to generate a low-level attribute profile 1710 from the digital image 1702. Further, as shown in fig. 17, the scene-based image editing system 106 utilizes a high-level embedding layer 1706 (e.g., of the embedded neural network 1604 of fig. 16) (e.g., of the embedding NNh) to generate a high-level attribute profile 1708 from the digital image 1702.

Specifically, in one or more embodiments, the scene-based image editing system 106 utilizes a convolutional neural network as an embedded neural network. For example, the scene-based image editing system 106 generates a D-dimensional image feature map having a spatial size H W extracted from a convolutional neural network-based embedded neural networkIn some cases, scene-based image editing system 106 utilizes the output of the penultimate layer of ResNet-50 as image feature map f _img (I)。

As shown in fig. 17, the scene-based image editing system 106 utilizes the high-level embedding layer and the low-level embedding layer embedded in the neural network to extract both the high-level attribute feature map 1708 and the low-level attribute feature map 1710. By extracting both the high-level attribute feature map 1708 and the low-level attribute feature map 1710 of the digital image 1702, the scene-based image editing system 106 addresses the heterogeneity of features between different categories of attributes. In practice, attributes span a broad semantic level.

By utilizing both low-level and high-level feature maps, the scene-based image editing system 106 accurately predicts attributes over a broad semantic level range. For example, the scene-based image editing system 106 utilizes low-level feature maps to accurately predict properties such as, but not limited to, colors (e.g., red, blue, multicolor) of depicted objects, patterns (e.g., stripes, dashed lines, stripes), geometries (e.g., shapes, sizes, gestures), textures (e.g., rough, smooth, jagged), or materials (e.g., woody, metallic, shiny, matte). Meanwhile, in one or more embodiments, the scene-based image editing system 106 utilizes high-level feature maps to accurately predict properties such as, but not limited to, object states (e.g., broken, dry, messy, full, stale) or actions (e.g., running, sitting, flying) of the depicted objects.

Further, as shown in FIG. 17, the scene-based image editing system 106 generates an image-object feature map 1714. In particular, as shown in fig. 17, the scene-based image editing system 106 combines an object-tag embedding vector 1712 (e.g., an object-tag embedding vector 1608 such as fig. 16) from a tag corresponding to an object (e.g., a "chair") with the high-level attribute feature map 1708 to generate an image-object feature map 1714 (e.g., an image-object feature map 1606 such as fig. 16). As further shown in fig. 17, the scene-based image editing system 106 utilizes a feature synthesis module (e.g., f _comp ) Which outputs an image-object feature map 1714 using the object-tag embedding vector 1712 and the high level attribute feature map 1708.

In one or more embodiments, the scene-based image editing system 106 generates the image-object feature map 1714 to provide additional signals to the multi-attribute contrast classification neural network to learn the relevant object for its predicted attributes (e.g., while also encoding features for that object). In particular, in some embodiments, the scene-based image editing system 106 incorporates an object-tag embedding vector 1712 (as feature synthesis module f _comp To generate an image-object feature map 1714) to improve the classification results of the multi-attribute contrast classification neural network by learning the multi-attribute contrast classification neural network to avoid infeasible object-attribute combinations (e.g., parked dogs, talking desks, barking sofas). Indeed, in some embodiments, scene-based image editing system 106 also utilizes object-tag embedded vector 1712 (as a feature setForming a module f _comp To) to enable the multi-attribute contrast classification neural network to learn to associate certain object-attribute pairs together (e.g., a sphere is always round). In many cases, the multi-attribute contrast classification neural network is enabled to focus on a particular visual aspect of an object by guiding the multi-attribute contrast classification neural network about what object it predicts attributes for. This in turn improves the quality of the extracted properties of the depicted object.

In one or more embodiments, the scene-based image editing system 106 utilizes a feature synthesis module (e.g., f _comp ) To generate an image-object feature map 1714. In particular, the scene-based image editing system 106 implements a feature synthesis module (e.g., f _comp )：

f _comp (f _img (I)，φ _o )＝f _img (I)⊙f _gate (φ _o )

And

f _comp (φ _o )＝σ(W _g2 ·ReLU(W _g1 φ _o +b _g1 )+b _g2 )

in the first function above, the scene-based image editing system 106 utilizes a high-level attribute feature map f _img (I) Channel product (+) and object-tag embedded vector of (+)Filter f of (2) _gate To generate an image-object feature map +.>

Furthermore, in the second function above, the scene-based image editing system 106 uses the broadcastThe sigmoid function σ in (a) matches the feature map space dimension as a layer 2 multi-layer perceptron (MLP). Indeed, in one or more embodiments, scene-based image editing system 106 uses f _gate As a filter, the filter selects attribute features associated with the object of interest (e.g., as embedded by the object-tag vector phi _o Indicated). In many cases, scene-based image editing system 106 also utilizes f _gate To suppress incompatible object-attribute pairs (e.g., speaking tables). In some embodiments, the scene-based image editing system 106 may identify object-image tags for each object depicted within the digital image and output attributes for each depicted object using the identified object-image tags by classifying the neural network using multi-attribute contrast.

Furthermore, as shown in FIG. 17, the scene-based image editing system 106 utilizes the image-object feature map 1714 with locator neural network 1716 to generate a local image-object feature vector Z _rel (e.g., also shown in FIG. 16 as localizer neural network 1612 and Z _rel ). Specifically, as shown in FIG. 17, the scene-based image editing system 106 uses a convolutional layer f with a locator neural network 1716 _rel A local object attention feature vector 1717 (e.g., G in fig. 16) reflecting the segmented prediction of the depicted object is generated. Then, as shown in FIG. 17, the scene-based image editing system 106 combines the local object attention feature vector 1717 with the image object feature map 1714 to generate a local image-object feature vector Z _rel . As shown in fig. 17, the scene-based image editing system 106 utilizes a matrix multiplication 1720 between the local object attention feature vector 1717 and the image object feature map 1714 to generate a local image-object feature vector Z _rel 。

In some cases, the digital image may include multiple objects (and/or backgrounds). Thus, in one or more embodiments, the scene-based image editing system 106 utilizes a localizer neural network to learn improved feature aggregation that suppresses irrelevant object regions (e.g., regions that are not reflected in the segmentation predictions of the target object to isolate the target object). For example, referring to the digital image 1702, the scene-based image editing system 106 utilizes the locator neural network 1716 to locate object regions such that the multi-attribute contrast classification neural network predicts attributes of the correct object (e.g., the depicted chair) but not other unrelated objects (e.g., the depicted horse). To this end, in some embodiments, the scene-based image editing system 106 utilizes a locator neural network that utilizes supervised learning of object segmentation masks (e.g., object masks related to reference truth) from a dataset of label images (e.g., reference truth images as described below).

To illustrate, in some cases, scene-based image editing system 106 utilizes a 2-stack convolution layer f _rel (e.g., kernel size 1), followed by spatial softmax to extract from the image-object feature map according to the following formulaA local object attention feature vector G (e.g., local object region) is generated:

for example, the local object attention feature vector G includes a single data plane of h×w (e.g., a feature map having a single dimension). In some cases, the local object attention feature vector G includes a feature map (e.g., a local object attention feature map) that includes one or more feature vector dimensions.

Then, in one or more embodiments, scene-based image editing system 106 utilizes local object attention feature vector G _h，w And image-object feature map X _h，w To generate a local image-object feature vector Z according to _rel ：

In some cases, in the above functionality, scene-based image editing system 106 uses attention feature vectors G from local objects _h，w Will beIn (from image-object feature map) H X W D-dimensional feature vector X _h，w Pooling into a single D-dimensional feature vector Z _rel 。

In one or more embodiments, referring to fig. 17, the scene-based image editing system 106 trains the locator neural network 1716 to learn the local object attention feature vector 1717 (e.g., G) with direct supervision of the object segmentation mask 1718 (e.g., the benchmark truth object segmentation mask 1618 from fig. 16).

Furthermore, as shown in FIG. 17, the scene-based image editing system 106 utilizes the image-object feature map 1714 with the multi-attention neural network 1722 to generate the multi-attention feature vector Z _att (e.g., multi-attention neural network 1614 and Z of FIG. 16) _att ). In particular, as shown in FIG. 17, the scene-based image editing system 106 utilizes a convolution layer f with an image-object feature map 1714 _att (e.g., attention layer) to extract attention profile 1724 (e.g., attention 1 through attention k) (e.g., attention profile 1620 of fig. 16). Then, as further shown in FIG. 17, scene-based image editing system 106 passes extracted attention attempts 424 (attention 1 through attention k) through projection layer f _proj To extract (e.g. via linear projection) the vector Z for generating multi-attention features _att Is provided.

In one or more embodiments, scene-based image editing system 106 utilizes multi-attention feature vector Z _att The properties of the depicted object within the digital image are accurately predicted by providing focus to different portions of the depicted object and/or areas surrounding the depicted object (e.g., focusing on features at different spatial locations). To illustrate, in some cases, scene-based image editing system 106 utilizes multi-attention feature vector Z _att To extract attributes such as "barefoot" or "optical head" by focusing on different parts of a person (i.e., object) depicted in the digital image. Likewise, in some embodiments, scene-based image editing system 106 utilizes multi-attention feature vector Z _att To distinguish between different activity attributes (e.g., jump and squat) that may depend on information from the surroundings of the depicted object.

In some cases, scene-based image editing system 106 generates an attention map for each attribute depicted for an object within the digital image. For example, the scene-based image editing system 106 utilizes an image-object feature map having one or more attention layers to generate an attention map from the image-object feature map for each known attribute. The scene-based image editing system 106 then uses the attention map with the projection layer to generate a multi-attention feature vector Z _att . In one or more embodiments, scene-based image editing system 106 generates various numbers of attention attempts for various properties depicted for objects within a digital image (e.g., the system may generate attention attempts for each property or a different number of attention attempts than the number of properties).

Furthermore, in one or more embodiments, scene-based image editing system 106 utilizes a hybrid shared multi-attention method that allows attention-jumping (attention hop) while generating attention-seeking diagrams from image-object feature maps. For example, scene-based image editing system 106 may utilize a convolutional layer according to the following functionImage-object feature map X (e.g. attention layer) extracts M attention attempts +.>

/>

And

in some cases, scene-based image editing system 106 utilizes a convolutional layerHaving a 2-stack convolution layer f with the above function (3) _rel A similar architecture. By utilizing the method outlined in the second function above, the scene-based image editing system 106 utilizes different sets of attention profiles corresponding to different ranges of attributes.

In one or more embodiments, scene-based image editing system 106 then uses M attention profiles (e.g.,) To aggregate M attention feature vectors from the image-object feature map X>

Furthermore, referring to FIG. 17, scene-based image editing system 106 sets M attention feature vectorsPass through the projection layer->To extract one or more attention feature vectors z according to the following function ^(m) ：

Then, in one or more embodiments, scene-based image editing system 106 connects the individual attention feature vectors by according to the following functionTo generate a multi-attention feature vector Z _att ：

In some embodiments, the scene-based image editing system 106 utilizes divergence loss with a multi-attention neural network in an M-attention-skip method. In particular, the scene-based image editing system 106 utilizes a loss of divergence that encourages attention to try to focus on different (or unique) regions of the digital image (from the image-object feature map). In some cases, scene-based image editing system 106 utilizes a divergence penalty by minimizing cosine similarity (e.g., l) between attention weight vectors (e.g., E) of attention features ₂ Norms) to promote diversity between attention features. For example, scene-based image editing system 106 determines a loss of divergence according to the following function

In one or more embodiments, the scene-based image editing system 106 utilizes loss of divergenceTo learn parameters of the multi-attentive neural network 1722 and/or the multi-attribute contrast classification neural network (as a whole).

In addition, as shown in FIG. 17, scene-based image editing system 106 also generates local low-level attribute feature vector Z _low (e.g., Z of FIG. 16 _low ). In fact, as shown in FIG. 17, the scene-based image editing system 106 generates a local low-level attribute feature vector Z by combining the low-level attribute feature map 1710 and the local object attention feature vector 1717 _low . For example, as shown in FIG. 17, the scene-based image editing system 106 utilizes matrix multiplication 1726 to combine the low-level attribute feature map 1710 and the local object attention feature vector 1717 to generate a local low-level attribute feature vector Z _low 。

In one or more embodiments, the local low-level attribute feature vector Z is generated and utilized by _low The scene-based image editing system 106 improves the accuracy of low-level features (e.g., color, material) extracted for objects depicted in the digital image. In particular, in one or more embodiments, the scene-based image editing system 106 pools low-level features (represented by low-level attribute feature maps from low-level embedding layers) from local object attention feature vectors (e.g., from a locator neural network). Indeed, in one or more embodiments, the scene-based image editing system 106 constructs a local low-level attribute feature vector Z by pooling low-level features from local object attention feature vectors using a low-level feature map _low 。

As further shown in FIG. 17, scene-based image editing system 106 utilizes a system having local image-object feature vectors Z _rel Multi-attention feature vector Z _att And a local low-level attribute feature vector Z _low Classifier neural network 1732 (f) _classifier ) (e.g., classifier neural network 1624 of fig. 16) to determine a positive attribute tag 1728 and a negative attribute tag 1730 for an object (e.g., "chair") depicted within digital image 1702. In some embodiments, scene-based image editing system 106 utilizes local image-object feature vector Z _rel Multi-attention feature vector Z _att And a local low-level attribute feature vector Z _low Is used as a connection to the classifier neural network 1732 (f _classifier ) Is input in the classification layer of (c). Then, as shown in FIG. 17, the classifierNeural network 1732 (f) _classifier ) Positive attribute labels 1728 (e.g., red, bright red, clean, giant, woody) are generated for the depicted objects in the digital image 1702, and negative attribute labels 1730 (e.g., blue, fill, pattern, multicolor) are also generated.

In one or more embodiments, the scene-based image editing system 106 utilizes a classifier neural network that is a layer 2 MLP. In some cases, scene-based image editing system 106 utilizes a classifier neural network that includes various amounts of hidden units and output logic values followed by sigmoid. In some embodiments, the classifier neural network is trained by the scene-based image editing system 106 to generate positive attribute tags and negative attribute tags. Although one or more embodiments described herein utilize a 2-layer MLP, in some cases, the scene-based image editing system 106 utilizes linear layers (e.g., within a classifier neural network, for fgate and for image-object feature maps).

Furthermore, in one or more embodiments, scene-based image editing system 106 utilizes local image-object feature vector Z _rel Multi-attention feature vector Z _att And a local low-level attribute feature vector Z _low In various combinations with the classifier neural network to extract attributes for objects depicted in the digital image. For example, in some cases, scene-based image editing system 106 provides local image-object feature vector Z _rel And a multi-attention feature vector Z _att To extract attributes for the depicted object. In some cases, as shown in FIG. 17, scene-based image editing system 106 utilizes each local image-object feature vector Z _rel Multi-attention feature vector Z _att And a local low-level attribute feature vector Z _low And connection with the classifier neural network.

In one or more embodiments, the scene-based image editing system 106 utilizes the classifier neural network 1732 to generate as output a predictive score corresponding to the attribute tags. For example, classifier neural network 1732 may generate a predictive score for one or more attribute tags (e.g., blue score of 0.04, red score of 0.9, orange score of 0.4). Then, in some cases, the scene-based image editing system 106 utilizes attribute tags corresponding to predictive scores that meet a threshold predictive score. Indeed, in one or more embodiments, the scene-based image editing system 106 selects various attribute tags (both positive and negative) by utilizing output predictive scores of attributes from the classifier neural network.

Although one or more embodiments herein describe a scene-based image editing system 106 that utilizes specific embedded neural networks, locator neural networks, multi-attention neural networks, and classifier neural networks, the scene-based image editing system 106 may utilize various types of neural networks for these components (e.g., CNNs, FCNs). Further, while one or more embodiments herein describe a scene-based image editing system 106 that combines various feature maps (and/or feature vectors) using matrix multiplication, in some embodiments, the scene-based image editing system 106 combines feature maps (and/or feature vectors) using various methods, such as, but not limited to, concatenation, multiplication, addition, and/or aggregation. For example, in some implementations, the scene-based image editing system 106 combines the local object attention feature vector and the image-object feature map by concatenating the local object attention feature vector and the image-object feature map to generate the local image-object feature vector.

Thus, in some cases, the scene-based image editing system 106 utilizes an attribute classification neural network (e.g., a multi-attribute contrast classification neural network) to determine object attributes of objects depicted in the digital image or attributes of semantic regions of the depiction that are otherwise determined. In some cases, scene-based image editing system 106 adds object properties or other properties determined for a digital image to a semantic scene graph of the digital image. In other words, the scene-based image editing system 106 utilizes the attribute-classification neural network to generate a semantic scene graph of the digital image. However, in some implementations, the scene-based image editing system 106 stores the determined object properties or other properties in a separate storage location.

Further, in one or more embodiments, scene-based image editing system 106 facilitates modifying object properties of objects depicted in a digital image by modifying one or more object properties in response to user input. Specifically, in some cases, scene-based image editing system 106 modifies object properties using a machine learning model such as a property modification neural network. FIG. 18 illustrates an attribute modification neural network used by the scene-based image editing system 106 to modify object attributes in accordance with one or more embodiments.

In one or more embodiments, the attribute modification neural network comprises a computer-implemented neural network that modifies specified object attributes (or other specified attributes that specify semantic regions) of an object. In particular, in some embodiments, the attribute modification neural network comprises a computer-implemented neural network that receives user input targeting and indicating a change in an object attribute and modifies the object attribute according to the indicated change. In some cases, the attribute modifying neural network comprises a generative network.

As shown in fig. 18, scene-based image editing system 106 provides object 1802 (e.g., a digital image depicting object 1802) and modification inputs 1804a-1804b to object modification neural network 1806. In particular, FIG. 18 shows modification inputs 1804a-1804b that include inputs for an object property to be changed (e.g., black of object 1802) and inputs for a change to occur (e.g., changing the color of object 1802 to white).

As shown in fig. 18, the object modifying neural network 1806 generates a visual feature map 1810 from the object 1802 using the image encoder 1808. In addition, the object modification neural network 1806 generates text features 1814a-1814b from the modification inputs 1804a-1804b using the text encoder 1812. In particular, as shown in FIG. 18, the object modifying neural network 1806 generates a visual feature map 1810 and text features 1814a-1814b (labeled "visual-semantic embedding space" or "VSE space") within the joint embedding space 1816.

In one or more embodiments, the object modification neural network 1806 performs text-guided visual feature manipulation to base modification inputs 1804a-1804b on visual feature map 1810 and manipulate corresponding regions of visual feature map 1810 with the provided text features. For example, as shown in fig. 18, the object modifying neural network 1806 generates a manipulated visual feature map 1820 from the visual feature map 1810 and the text features 1814a-1814b using operation 1818 (e.g., vector arithmetic operations).

As further shown in fig. 18, the object-modifying neural network 1806 also utilizes a fixed edge extractor 1822 to extract edges 1824 (boundaries) of the object 1802. In other words, the object modification neural network 1806 utilizes a fixed edge extractor 1822 to extract edges 1824 of the region to be modified.

Further, as shown, the object modification neural network 1806 utilizes a decoder 1826 to generate modification objects 1828. In particular, the decoder 1826 generates a modified object 1828 from the edge 1824 extracted from the object 1802 and the manipulated visual feature map 1820 generated from the object 1802 and the modified inputs 1804a-1804 b.

In one or more embodiments, the scene-based image editing system 106 trains the object modification neural network 1806 to process open vocabulary instructions and open domain digital images. For example, in some cases, the scene-based image editing system 106 utilizes the large-scale image subtitle data set training object to modify the neural network 1806 to learn a generic visual semantic embedding space. In some cases, scene-based image editing system 106 modifies the encoder of neural network 1806 using convolutional neural network and/or long-term memory network as objects to convert digital images and text input into visual and text features.

A more detailed description of text-guided visual feature manipulation is provided below. As previously described, in one or more embodiments, scene-based image editing system 106 utilizes joint embedded space 1816 to manipulate visual feature map 1810 with text instructions modifying inputs 1804a-1804b via vector arithmetic operations. When manipulating certain objects or object properties, the object modifying neural network 1806 is intended to modify only certain regions while maintaining other regions unchanged. Thus, the object modifying neural network 1806 is represented as Vector arithmetic operations are performed between visual feature map 1810 and text features 1814a-1814b (e.g., represented as text feature vectors).

For example, in some cases, the object modification neural network 1806 identifies regions in the visual feature map 1810 to be manipulated on the spatial feature map (i.e., with modification inputs 1804a-1804b as the basis). In some cases, the object modifying neural network 1806 provides soft grounding for the text query via a weighted sum of the visual feature maps 1810. In some cases, the object modification neural network 1806 uses text features 1814a-1814b (denoted as) Calculating a weighted sum of the visual feature map 1810 as a weight +.>Using this method, the object modification neural network 1806 provides a soft-junction graph +.>Which roughly locates the corresponding region in visual feature map 1810 associated with the text instruction.

In one or more embodiments, the object modifying neural network 1806 uses a joint map (grouping map) as a location adaptation coefficient to control steering intensities at different locations. In some cases, the object modifying neural network 1806 utilizes the coefficient α to control the global steering intensity, which enables continuous conversion between the source image and the steering image. In one or more embodiments, scene-based image editing system 106 maps visual features The visual feature vector at spatial position (i, j) (where i, j e {0,1,..6 }) is denoted +.>

The scene based image editing system 106 performs various types of operations using the object modifying neural network 1806 via vector arithmetic operations weighted by soft-joint graphs and coefficients α. For example, in some cases, the scene-based image editing system 106 utilizes the object modification neural network 1806 to change an object property or a global property. The object modification neural network 1806 embeds text features of a source concept (e.g., "black triangle") and a target concept (e.g., "white triangle"), denoted t, respectively ₁ And t ₂ . The object modifying neural network 1806 performs image feature vector v at position (i, j) as follows ^i，j Is controlled by:

where i, j e {0,1,..6 } andis a visual feature vector manipulated at position (i, j) of the 7 x 7 feature map.

In one or more embodiments, the object modifying neural network 1806 removes the source signature t ₁ And target feature t ₂ Added to each visual feature vector v ^i，j . In addition, in the case of the optical fiber,<v ^i，j ，t ₁ >the value representing the soft map at position (i, j) is calculated as the dot product of the image feature vector and the source text feature. In other words, the value represents the visual embedding v ^i，j At text embedding t ₁ In the direction of the plane of the object. In some cases, the object modifying neural network 1806 uses this value as a location adaptive steering strength to control which regions in the image should be edited. In addition, the object modifying neural network 1806 uses the coefficient α as a super parameter that controls the image-level manipulation strength. By smoothly increasing α, the object modifying neural network 1806 achieves a smooth transition from the source attribute to the target attribute.

In some implementations, the scene-based image editing system 106 utilizes the object modification neural network 1806 to remove concepts (e.g., object properties, objects, or other visual elements) from the digital image (e.g., remove attachments to humans). In some cases, the object modification neural network 1806 represents the semantic embedding of concepts to be removed as t. Thus, the object modifying neural network 1806 performs the following removal operation:

furthermore, in some embodiments, the scene-based image editing system 106 utilizes the object modification neural network 1806 to modify the extent to which object properties (or other properties of the semantic region) appear (e.g., make red apples less red or increase the brightness of the digital image). In some cases, the object modifies the intensity of the neural network 1806 control attribute via the hyper-parameter α. By smoothly adjusting α, the subject modifying neural network 1806 gradually emphasizes or attenuates the extent to which the attribute appears, as follows:

In deriving the manipulation feature mapThereafter, the object modification neural network 1806 utilizes a decoder 1826 (image decoder) to generate a manipulated image (e.g., modification object 1828). In one or more embodiments, the scene-based image editing system 106 trains the object modification neural network 1806, as described by F.Faghri et al in "Vse++: improving visual-semantic Embeddings with Hard Negatives" (Vse++: embedded with hard negative improvement visual semantics), arXiv:1707.05612, 2017zhong, which is incorporated herein by reference in its entirety. In some cases, decoder 1826 takes 1024×7×7 feature maps as input and consists of 7 res net blocks with an upsampling layer in between, which generates 256×256 images. Furthermore, in some cases, scene-based image editing system 106 utilizes a discriminator, including a multi-scale patch-basedIs a discriminator of the above (a). In some implementations, the scene-based image editing system 106 trains the decoder 1826 with GAN loss, perceptual loss, and discriminator feature matching loss. Furthermore, in some embodiments, the fixed edge extractor 1822 includes a bi-directional cascade network.

19A-19C illustrate a graphical user interface implemented by the scene-based image editing system 106 to facilitate modifying object properties of objects depicted in a digital image in accordance with one or more embodiments. In fact, while FIGS. 19A-19C specifically illustrate modifying object properties of an object, it should be noted that in various embodiments, the scene-based image editing system 106 similarly modifies properties of other semantic regions (e.g., background, foreground, ground, sky, etc.) of a digital image.

In effect, as shown in FIG. 19A, the scene-based image editing system 106 provides a graphical user interface 1902 for display on a client device 1904 and provides a digital image 1906 for display within the graphical user interface 1902. As further shown, digital image 1906 depicts object 1908.

As further shown in FIG. 19A, in response to detecting a user interaction with object 1908, scene-based image editing system 106 provides properties menu 1910 for display within graphical user interface 1902. In some embodiments, the properties menu 1910 provides one or more object properties of the object 1908. In effect, FIG. 19A shows that the properties menu 1910 provides object property indicators 1912a-1912c that indicate the shape, color, and material, respectively, of the object 1908. However, it should be noted that various alternative or additional object properties are provided in various embodiments.

In one or more embodiments, scene-based image editing system 106 retrieves object properties for object property indicators 1912a-1912c from the semantic scene graph generated for digital image 1906. Indeed, in some implementations, scene-based image editing system 106 generates a semantic scene graph for digital image 1906 (e.g., before detecting user interaction with object 1908). In some cases, scene-based image editing system 106 utilizes an attribute classification neural network to determine object attributes of object 1908 and includes the determined object attributes in the semantic scene graph. In some implementations, the scene-based image editing system 106 retrieves object properties from separate storage locations.

As shown in fig. 19B, scene-based image editing system 106 detects user interactions with object attribute indicator 1912 c. In fact, in one or more embodiments, the object attribute indicators 1912a-1912c are interactive. As shown, in response to detecting the user interaction, scene-based image editing system 106 removes the corresponding object properties of object 1908 from the display. Further, in response to detecting the user interaction, the scene-based image editing system 106 provides a numeric keypad 1914 for display within the graphical user interface 1902. Thus, scene based image editing system 106 provides a prompt to enter text user input. In some cases, upon detecting a user interaction with object attribute indicator 1912c, scene-based image editing system 106 maintains the corresponding object attribute for display, allowing the user interaction to remove the object attribute upon confirming that the object attribute has become the modification target.

As shown in fig. 19C, scene-based image editing system 106 detects one or more user interactions with numeric keyboard 1914 displayed within graphical user interface 1902. In particular, scene based image editing system 106 receives text user input provided via numeric keyboard 1914. Scene-based image editing system 106 also determines that text user input provides a change to the object properties corresponding to object properties indicator 1912 c. Further, as shown, scene-based image editing system 106 provides text user input for display as part of object attribute indicator 1912 c.

In this case, user interaction with the graphical user interface 1902 provides instructions to change the material (material) of the object 1908 from a first material (e.g., wood) to a second material (e.g., metal). Thus, upon receiving text user input regarding the second material, the scene-based image editing system 106 modifies the digital image 1906 by modifying the object properties of the object 1908 to reflect the second material provided by the user.

In one or more embodiments, scene-based image editing system 106 utilizes a property modification neural network to change object properties of object 1908. In particular, as described above with reference to fig. 18, the scene-based image editing system 106 provides the digital image 1906 and the modification input of the first material and the second material composition provided by the text user input to the attribute modification neural network. Thus, the scene-based image editing system 106 utilizes the attribute modification neural network to provide as output a modified digital image depicting the object 1908 with modified object attributes.

FIGS. 20A-20C illustrate another graphical user interface implemented by the scene-based image editing system 106 to facilitate modifying object properties of objects depicted in a digital image in accordance with one or more embodiments. As shown in fig. 20A, the scene-based image editing system 106 provides a digital image 2006 depicting an object 2008 for display within a graphical user interface 2002 of a client device 2004. Further, upon detecting a user interaction with object 2008, scene-based image editing system 106 provides an properties menu 2010 having property object indicators 2012a-2012c listing object properties of object 2008.

As shown in fig. 20B, scene-based image editing system 106 detects additional user interactions with object property indicator 2012 a. In response to detecting the additional user interaction, the scene-based image editing system 106 provides an alternative properties menu 2014 for display within the graphical user interface 2002. In one or more embodiments, the alternative properties menu 2014 includes one or more options for changing the properties of the corresponding object. In effect, as shown in FIG. 20B, alternative properties menu 2014 includes alternative properties indicators 2016a-2016c that provide object properties that can be used in place of the current object properties of object 2008.

As shown in fig. 20C, scene based image editing system 106 detects user interaction with alternative attribute indicator 2016 b. Thus, scene-based image editing system 106 modifies digital image 2006 by modifying the object properties of object 2008 according to user input with alternative property indicator 2016 b. In particular, scene-based image editing system 106 modifies object 2008 to reflect alternative object properties associated with alternative property indicator 2016 b.

In one or more embodiments, scene-based image editing system 106 utilizes a textual representation of alternative object properties when modifying object 2008. For example, as described above, scene-based image editing system 106 provides a text representation as text input to an attribute-modifying neural network and utilizes the attribute-modifying neural network to output a modified digital image, where object 2008 reflects a target change in its object attributes.

Fig. 21A-21C illustrate another graphical user interface implemented by the scene-based image editing system 106 to facilitate modifying object properties of objects depicted in a digital image in accordance with one or more embodiments. As shown in fig. 21A, the scene-based image editing system 106 provides a digital image 2106 depicting an object 2108 for display within a graphical user interface 2102 of a client device 2104. In addition, upon detecting a user interaction with object 2108, scene-based image editing system 106 provides attribute menu 2110 with attribute object indicators 2112a-2012c listing object attributes of object 2108.

As shown in fig. 21B, scene-based image editing system 106 detects additional user interactions with object property indicator 2112B. In response to detecting the additional user interaction, scene-based image editing system 106 provides a slider bar 2114 for display within graphical user interface 2102. In one or more embodiments, the slider bar 2114 includes a slider bar element 2116 that indicates the extent to which the corresponding object attribute appears in the digital image 2106 (e.g., the intensity of its presence in the digital image 2106).

As shown in fig. 21C, the scene-based image editing system 106 detects a user interaction with the slider element 2116 of the slider 2114, thereby increasing the extent to which the corresponding object attribute appears in the digital image. Thus, the scene-based image editing system 106 modifies the digital image 2106 by modifying the object 2108 to reflect the increased intensity in the appearance of the corresponding object attribute.

In particular, in one or more embodiments, scene-based image editing system 106 utilizes attributes to modify the neural network to modify digital image 2106 in accordance with user interactions. In fact, as described above with reference to fig. 18, the scene-based image editing system 106 is able to modify the intensity of the appearance of the object properties via the coefficient α (strength or weakness). Thus, in one or more embodiments, the scene-based image editing system 106 adjusts the coefficient α based on the positioning of the slider element 2116 via user interaction.

By facilitating image modification targeting specific object properties as described above, the scene-based image editing system 106 provides improved flexibility and efficiency over conventional systems. In effect, the scene-based image editing system 106 provides a flexible, intuitive method of visually displaying descriptions of properties of objects and allowing user input interacting with those descriptions to change properties. Thus, the scene-based image editing system 106 allows user interaction to target object properties at a high level of abstraction (e.g., without having to interact at the pixel level), rather than requiring cumbersome manual manipulation of object properties as is typical in many conventional systems. Furthermore, because scene-based image editing system 106 enables modification of object properties via relatively few user interactions with provided visual elements, scene-based image editing system 106 enables a graphical user interface that provides improved efficiency.

As previously described, in one or more embodiments, scene-based image editing system 106 also uses a semantic scene graph generated for digital images to implement relationship-aware object modification. In particular, the scene-based image editing system 106 utilizes the semantic scene graph to inform the modifying behavior of objects depicted in the digital image based on their relationship to one or more other objects in the digital image. FIGS. 22A-25D illustrate implementing a relationship-aware object modification in accordance with one or more embodiments.

In fact, many conventional systems are inflexible in that they require different objects to interact individually for modification. This is often the case even if different objects are to be similarly modified (e.g., similarly resized or moved). For example, conventional systems typically require execution of a separate workflow via user interaction to modify a separate object, or at least to perform a preparatory step of modification (e.g., outlining the object and/or separating the object from the rest of the image). Furthermore, conventional systems are often unable to adapt to relationships between objects in digital images when modifications are performed. In practice, these systems may modify a first object within a digital image, but cannot perform modifications to a second object based on the relationship between the two objects. Thus, the resulting modified image may appear unnatural or aesthetically confusing because it does not correctly reflect the relationship between the two objects.

Thus, conventional systems are also typically inefficient in that they require a significant amount of user interaction to modify the individual objects depicted in the digital image. Indeed, as noted above, conventional systems typically require execution of a separate workflow via user interaction to perform many of the steps required to modify a separate object. Thus, many user interactions are redundant in that they are received, processed, and responded to multiple times for individual objects. Furthermore, when an object having a relationship with another object is modified, conventional systems require additional user interaction to modify the other object according to the relationship. Thus, these systems unnecessarily repeat interactions used (e.g., interactions for moving objects and then moving related objects) to perform individual modifications to related objects, even if the relationship implies modifications to be performed.

The scene-based image editing system 106 provides greater flexibility and efficiency than conventional systems by enabling relational awareness object modification. Indeed, as will be discussed, the scene-based image editing system 106 provides a flexible, simplified process for selecting related objects for modification. Thus, the scene-based image editing system 106 flexibly allows user interaction to select and modify multiple objects depicted in a digital image via a single workflow. In addition, scene-based image editing system 106 facilitates intuitive modifications to related objects such that the resulting modified image continues to reflect the relationship. Thus, the digital image modified by the scene-based image editing system 106 provides a more natural look than conventional systems.

Furthermore, the scene-based image editing system 106 improves efficiency by implementing a simplified process for selecting and modifying related objects. In particular, scene-based image editing system 106 implements a graphical user interface that reduces the user interaction required to select and modify multiple related objects. Indeed, as will be discussed, the scene-based image editing system 106 handles a relatively small amount of user interaction with one object to predict, suggest, and/or perform modifications to other objects, thereby eliminating the need for additional user interaction for such modifications.

For example, FIGS. 22A-22D illustrate a graphical user interface implemented by the scene-based image editing system 106 to facilitate relationship-aware object modification (relationship-aware object modification) in accordance with one or more embodiments. In effect, as shown in FIG. 22A, the scene-based image editing system 106 provides a digital image 2206 depicting objects 2208a-2208b and object 2220 for display within the graphical user interface 2202 of the client device 2204. In particular, digital image 2206 depicts a relationship between objects 2208a-2208b, wherein object 2208a is holding object 2208b.

In one or more embodiments, scene-based image editing system 106 references a semantic scene graph previously generated for digital image 2206 to identify relationships between objects 2208a-2208 b. Indeed, as previously described, in some cases, scene-based image editing system 106 includes relationships between objects of digital images in a semantic scene graph generated for the digital images. For example, in one or more embodiments, scene-based image editing system 106 determines relationships between objects using a machine learning model, such as one of the models discussed above with reference to fig. 15 (e.g., clustering and subgraph suggestion generation models). Thus, the scene based image editing system 106 includes the determined relationships within the representation of the digital image in the semantic scene graph. Further, prior to receiving a user interaction to modify any of the objects 2208a-2208b, the scene-based image editing system 106 determines a relationship between the objects 2208a-2208b for inclusion in the semantic scene graph.

In effect, FIG. 22A illustrates a semantic scene graph component 2210 from the semantic scene graph of digital image 2206. In particular, semantic scene graph component 2210 includes node 2212a representing object 2208a and node 2212b representing object 2208 b. Further, semantic scene graph component 2210 includes relationship indicators 2214a-2214b associated with nodes 2212a-2212 b. Relationship indicators 2214a-2214b indicate relationships between objects 2208a-2208b, where object 2208a holds object 2208b, and object 2208b is in turn held by object 2208 a.

As further shown, semantic scene graph component 2210 includes behavior indicators 2216a-2216b associated with relationship indicator 2214 b. The behavior indicators 2216a-2216b assign behaviors to the object 2208b based on the relationship of the object 2208b to the object 2208a. For example, behavior indicator 2216a indicates that object 2208b moved with object 2208a because object 2208b was held by object 2208a. In other words, when object 2208a is moved, behavior indicator 2216a instructs scene-based image editing system 106 to move object 2208b (or at least suggest that object 2208b is moved). In one or more embodiments, scene-based image editing system 106 includes behavior indicators 2216a-2216b within a semantic scene graph based on behavior policy graphs used in generating the semantic scene graph. Indeed, in some cases, the behavior assigned to a "holding" relationship (or other relationship) varies based on the behavior policy map used. Thus, in one or more embodiments, scene-based image editing system 106 references previously generated semantic scene graphs to identify relationships between objects and behaviors assigned based on these relationships.

It should be noted that semantic scene graph component 2210 indicates that the behavior of behavior indicators 2216a-2216b are assigned to object 2208b rather than object 2208a. Indeed, in one or more embodiments, scene-based image editing system 106 assigns behaviors to objects based on their roles in the relationship. For example, while it may be appropriate to move a held object (hold object) when moving the held object, in some embodiments, the scene-based image editing system 106 determines that the held object does not have to be moved when moving the held object. Thus, in some implementations, the scene-based image editing system 106 assigns different behaviors to different objects in the same relationship.

As shown in fig. 22B, scene-based image editing system 106 determines a user interaction to select object 2208 a. For example, scene-based image editing system 106 determines that the user interaction targets object 2208a for modification. As further shown, scene-based image editing system 106 provides visual indication 2218 for display to indicate selection of object 2208 a.

As shown in fig. 22C, in response to detecting a user interaction selecting object 2208a, scene-based image editing system 106 automatically selects object 2208b. For example, in one or more embodiments, upon detecting a user interaction selecting object 2208a, scene-based image editing system 106 references a semantic scene graph generated for digital image 2206 (e.g., semantic scene graph component 2210 corresponding to object 2208 a). Based on the information represented in the semantic scene graph, scene-based image editing system 106 determines that another object exists in digital image 2206 that has a relationship with object 2208 a. In effect, scene-based image editing system 106 determines that object 2208a holds object 2208b. Instead, scene-based image editing system 106 determines that object 2208b is held by object 2208 a.

Because objects 2208a-2208b have a relationship, scene-based image editing system 106 adds object 2208b to the selection. As shown in fig. 22C, scene-based image editing system 106 modifies visual indication 2218 of the selection to indicate that object 2208b has been added to the selection. Although FIG. 22C illustrates automatic selection of object 2208b, in some cases, scene-based image editing system 106 selects object 2208b based on the behavior assigned to object 2208b within the semantic scene graph according to the relationship of object 2208b to object 2208 a. Indeed, in some cases, the scene-based image editing system 106 specifies when a relationship between objects results in automatically selecting one object when the user selects another object (e.g., via a "select together" behavior). However, as shown in fig. 22C, in some cases, the scene-based image editing system 106 automatically selects the object 2208b by default.

In one or more embodiments, scene-based image editing system 106 presents object masks for object 2208a and object 2208b based on object 2208a and object 2208b being included in the selection. In effect, scene-based image editing system 106 presents pre-generated object masks to objects 2208a-2208b in anticipation of modifications to objects 2208a-2208 b. In some cases, scene-based image editing system 106 retrieves pre-generated object masks from a semantic scene graph of digital image 2206 or retrieves storage locations for pre-generated object masks. In either case, object masking is readily available after the objects 2208a-2208b are included in the selection and before modification input has been received.

As further shown in fig. 22C, scene-based image editing system 106 provides an options menu 2222 for display within graphical user interface 2202. In one or more embodiments, scene-based image editing system 106 determines that at least one of the modification options from options menu 222 will apply to both objects 2208a-2208b if selected. In particular, scene-based image editing system 106 determines that the modification selected for object 2208a will also apply to object 2208b based on the behavior assigned to object 2208b.

Indeed, in one or more embodiments, in addition to determining the relationship between objects 2208a-2208b, scene-based image editing system 106 references the semantic scene graph for digital image 2206 to determine the behavior that has been assigned based on the relationship. In particular, scene-based image editing system 106 references behavior indicators (e.g., behavior indicators 2216a-2216 b) associated with relationships between objects 2208a-2208b to determine which behaviors are assigned to objects 2208a-2208b based on their relationships. Thus, by determining the behavior assigned to object 2208b, scene-based image editing system 106 determines how to respond to the potential edits.

For example, as shown in FIG. 22D, scene-based image editing system 106 deletes objects 2208a-2208b together. For example, in some cases, in response to detecting a selection of option 2224 presented within option menu 2222, scene-based image editing system 106 deletes objects 2208a-2208b. Thus, while object 2208a is targeted for deletion via user interaction, scene-based image editing system 106 includes object 2208b in a delete operation based on the behavior assigned to object 2208b via the semantic scene graph (i.e., the "delete together" behavior). Thus, in some embodiments, the scene-based image editing system 106 implements relationship-aware object modification by deleting objects based on their relationship to other objects.

As previously described, in some implementations, if the assignment behavior of the scene-based image editing system 106 specifies that the object should be selected along with another object, the scene-based image editing system 106 only adds the object to the selection. In at least some cases, if its assignment behavior specifies that the object should be selected with another object, the scene-based image editing system 106 adds the object just prior to receiving any modification input. Indeed, in some cases, only a subset of the potential edits to the first object are applicable to the second object based on the behavior assigned to the second object. Thus, if there is no behavior providing automatic selection, then including the second object in the selection of the first object prior to receiving the modification input risks violating the rules set forth by the policy map via the semantic scene map. To avoid this risk, in some implementations, the scene-based image editing system 106 waits until a modification input has been received before determining whether to add the second object to the selection. However, in one or more embodiments, as shown in FIGS. 22A-22D, the scene-based image editing system 106 automatically adds a second object upon detecting a selection of the first object. In such embodiments, the scene-based image editing system 106 deselects the second object when it is determined that the modification to the first object is not applicable to the second object based on the behavior assigned to the second object.

As further shown in fig. 22D, object 2220 remains in digital image 2206. In effect, scene-based image editing system 106 does not add object 2220 to the selection in response to user interaction with object 2208a, nor does object 2220 delete with objects 2208a-2208 b. For example, upon referencing the semantic scene graph of digital image 2206, scene-based image editing system 106 determines that there is no relationship (at least, no relationship applied to the scene) between object 2220 and any of objects 2208a-2208 b. Thus, the scene-based image editing system 106 enables user interactions to modify related objects together while preventing non-related objects from being modified without more targeted user interactions.

Further, as shown in FIG. 22D, when objects 2208a-2208b are removed, scene-based image editing system 106 exposes content fill 2226 within digital image 2206. In particular, upon deleting an object 2208a-2208b, the scene-based image editing system 106 exposes content fills previously generated for the object 2208a and content fills previously generated for the object 2208 b. Thus, the scene-based image editing system 106 facilitates seamless modification of the digital image 2206 as if it were a real scene.

23A-23C illustrate another graphical user interface implemented by the scene-based image editing system 106 to facilitate relational aware object modification in accordance with one or more embodiments. In effect, as shown in FIG. 23A, the scene-based image editing system 106 provides digital images 2306 depicting objects 2308a-2308b and objects 2320 for display within a graphical user interface 2302 of a client device 2304. In particular, the digital image 2306 depicts the relationship between objects 2308a-2308b, where object 2308a holds object 2308b.

As further shown in fig. 23A, scene-based image editing system 106 detects a user interaction selecting object 2308 a. In response to detecting the user interaction, scene-based image editing system 106 provides a suggestion to add object 2308b to the selection. In particular, scene-based image editing system 106 provides an option 2312 for asking the user if he or she wishes to add object 2308b to selected text box 2310, and an option 2314 for agreeing to add object 2308b and an option 2314 for rejecting to add object 2308b.

In one or more embodiments, the scene-based image editing system 106 provides suggestions for adding objects 2308b to a selection based on determining relationships between objects 2308a-2308b via a semantic scene graph generated for the digital image 2306. In some cases, scene-based image editing system 106 also provides suggestions for adding object 2308b according to the behavior assigned to object 2308b based on the relationship.

As shown in fig. 23A, scene-based image editing system 106 does not suggest adding object 2320 to the selection. Indeed, in one or more embodiments, based on the reference semantic scene graph, scene-based image editing system 106 determines that there is no relationship (at least no correlation) between object 2320 and any of objects 2308a-2308 b. Thus, scene based image editing system 106 determines to omit object 2320 from the suggestion.

As shown in fig. 23B, scene-based image editing system 106 adds object 2308B to the selection. In particular, in response to receiving a user interaction with option 2312 to agree to add object 2308b, scene-based image editing system 106 adds object 2308b to the selection. As shown in fig. 23B, scene-based image editing system 106 modifies visual indication 2316 of the selection to indicate that object 2308B is added to the selection along with object 2308 a.

As shown in fig. 23C, in response to detecting one or more additional user interactions, scene-based image editing system 106 modifies digital image 2306 by moving object 2308a within digital image 2306. Further, scene-based image editing system 106 moves object 2308b along with object 2308a based on including object 2308b in the selection. Thus, the scene-based image editing system 106 implements relationship-aware object modification by moving objects based on their relationships to other objects.

24A-24C illustrate yet another graphical user interface implemented by the scene-based image editing system 106 to facilitate relational awareness object modification in accordance with one or more embodiments. In effect, as shown in FIG. 24A, scene-based image editing system 106 provides digital image 2406 depicting objects 2408a-2408b and object 2420 for display within graphical user interface 2402 of client device 2404. Specifically, digital image 2406 depicts a relationship between objects 2408a-2408b because object 2408a holds object 2408b.

As shown in fig. 24A, scene-based image editing system 106 detects a user interaction with object 2408 a. In response to detecting the user interaction, the scene-based image editing system 106 provides an options menu 2410 for display within the graphical user interface 2402. As shown, the options menu 2410 includes an option 2412 for deleting the object 2408 a.

As shown in fig. 24B, scene-based image editing system 106 detects additional user interactions with option 2412 for deleting object 2408 a. In response to detecting the additional user interaction, scene-based image editing system 106 provides a suggestion for adding object 2408b to the selection for display via text box 2414 asking the user if the user wishes to add object 2408b to the selection, option 2416 for agreeing to add object 2408b, and option 2418 for refusing to add object 2308 b.

Indeed, as described above, in one or more embodiments, the scene-based image editing system 106 waits to receive input to modify the first object before suggesting the addition of the second object (or automatically adding the second object). Thus, the scene-based image editing system 106 determines whether the relationship between the object and the pending modification indicates that the second object should be added before including the second object in the selection.

To illustrate, in one or more embodiments, upon detecting additional user interaction with option 2412, scene-based image editing system 106 references the semantic scene graph of digital image 2406. Upon referencing the semantic scene graph, scene-based image editing system 106 determines that object 2408a has a relationship with object 2408b. Further, the behavior assigned to object 2408b based on the relationship determination by scene-based image editing system 106 indicates that object 2408b should be deleted with object 2408 a. Thus, upon receiving additional user interactions for deleting object 2408a, scene-based image editing system 106 determines that object 2408b should also be deleted, and then provides a suggestion to add object 2408b (or automatically add object 2408 b) to the selection.

As shown in fig. 24C, scene-based image editing system 106 deletes object 2408a and object 2408b together from digital image 2406. Specifically, in response to detecting a user interaction with the option 2416 for adding the object 2408b to the selection, the scene-based image editing system 106 adds the object 2408b and performs a delete operation. In one or more embodiments, upon detecting a user interaction with the option 2418 to reject the add object 2408b, the scene-based image editing system 106 omits the object 2408b from the selection and deletes only the object 2408a.

Although moving objects or deleting objects based on their relationship to other objects is specifically discussed above, it should be noted that in various embodiments, scene-based image editing system 106 implements various other types of relationship-aware object modifications. For example, in some cases, the scene-based image editing system 106 implements the relationship-aware object modification via resizing, re-coloring, or restoring the original texture modification or composition. Further, as previously described, in some embodiments, the behavior policies used by the scene-based image editing system 106 are configurable. Thus, in some implementations, the relationship-aware object modifications implemented by the scene-based image editing system 106 change based on user preferences.

In one or more embodiments, in addition to modifying objects based on relationships described within the behavior policy graphs incorporated into the semantic scene graph, scene-based image editing system 106 modifies objects based on classification relationships. In particular, in some embodiments, the scene-based image editing system 106 modifies objects based on relationships described by the real-world category description map incorporated into the semantic scene graph. In fact, as previously described, the real world category description map provides a hierarchy of object classifications for objects that may be depicted in a digital image. Thus, in some implementations, the scene-based image editing system 106 modifies objects within the digital image via their respective object classification hierarchies based on their relationships to other objects. For example, in one or more embodiments, scene-based image editing system 106 adds objects to the selection for modification via its respective object classification hierarchy based on the relationships of the objects to other objects. FIGS. 25A-25D illustrate a graphical user interface implemented by the scene-based image editing system 106 to add objects to a selection for modification based on a classification relationship in accordance with one or more embodiments.

In particular, FIG. 25A illustrates that the scene-based image editing system 106 provides a digital image 2506 depicting a plurality of objects 2508a-2508g for display in a graphical user interface 2502 of a client device 2504. Specifically, as shown, the objects 2508a-2508g include various items, such as shoes, eyeglasses, and jackets.

FIG. 25A also shows semantic scene graph components 2510a-2510c from the semantic scene graph of digital image 2506. In effect, the semantic scene graph components 2510a-2510c include portions of a semantic scene graph that provide an object classification hierarchy for each of the objects 2508a-2508 g. Alternatively, in some cases, semantic scene graph components 2510a-2510c represent portions of a real world category description graph used to make a semantic scene graph.

As shown in fig. 25A, semantic scene graph component 2510a includes a node 2512 representing a clothing category, a node 2514 representing an accessory category, and a node 2516 representing a shoe category. As further shown, the accessory category is a subcategory of the apparel category, and the footwear category is a subcategory of the accessory category. Similarly, the semantic scene graph component 2510b includes a node 2518 representing a clothing category, a node 2520 representing an accessory category, and a node 2522 representing a glasses category as a sub-category of the accessory category. Further, the semantic scene graph component 2510c includes a node 2524 representing a clothing category and a node 2526 representing an overcoat category as another sub-category of the clothing category. Thus, semantic scene graph components 2510a-2510c provide various classifications that apply to each of the objects 2508a-2508 g. In particular, semantic scene graph component 2510a provides a hierarchy of object classifications associated with shoes presented in digital image 2506, semantic scene graph component 2510b provides a hierarchy of object classifications associated with glasses, and semantic scene graph component 2510c provides a hierarchy of object classifications associated with jackets.

As shown in fig. 25B, scene-based image editing system 106 detects a user interaction selecting object 2508 e. In addition, scene-based image editing system 106 detects a user interaction selecting object 2508 b. As further shown, in response to detecting the selection of object 2508b and object 2508e, scene-based image editing system 106 provides a text box 2528, which text box 2528 suggests adding all shoes in digital image 2506 to the selection.

To illustrate, in some embodiments, in response to detecting selection of object 2508b and object 2508e, scene-based image editing system 106 references a semantic scene graph (e.g., semantic scene graph components associated with object 2508b and object 2508 e) generated for digital image 2506. Based on the reference semantic scene graph, scene-based image editing system 106 determines that both object 2508b and object 2508e are part of a footwear category. Thus, scene-based image editing system 106 determines, via the footwear category, that a classification relationship exists between object 2508b and object 2508 e. In one or more embodiments, based on determining that both object 2508b and object 2508e are part of a footwear category, scene-based image editing system 106 determines to provide a user interaction that selects to target all of the shoes within digital image 2506. Accordingly, scene-based image editing system 106 provides a text box 2528, which text box 2528 suggests adding other shoes to the selection. In one or more embodiments, upon receiving a user interaction accepting the suggestion, the scene-based image editing system 106 adds other shoes to the selection.

Similarly, as shown in FIG. 25C, scene-based image editing system 106 detects a user interaction selecting object 2508C and another user interaction selecting object 2508 b. In response to detecting the user interaction, scene-based image editing system 106 references the semantic scene graph generated for digital image 2506. Based on the reference semantic scene graph, the scene-based image editing system 106 determines that the object 2508b is part of a footwear category that is a sub-category of the accessory category. In other words, the scene-based image editing system 106 determines that the object 2508b is part of a accessory category. Likewise, the scene-based image editing system 106 determines that the object 2508c is part of a glasses category that is a sub-category of an accessory category. Thus, the scene-based image editing system 106 determines that a classification relationship exists between the object 2508b and the object 2508c via the accessory category. As shown in fig. 25C, based on determining that both object 2508b and object 2508C are part of an accessory category, scene-based image editing system 106 provides a text box 2530 suggesting that all other accessories depicted in digital image 2506 (e.g., other shoes and glasses) be added to the selection.

In addition, as shown in FIG. 25D, scene-based image editing system 106 detects a user interaction that selects object 2508a and another user interaction that selects object 2508 b. In response to detecting the user interaction, scene-based image editing system 106 references the semantic scene graph generated for digital image 2506. Based on the reference semantic scene graph, the scene-based image editing system 106 determines that the object 2508b is part of a footwear category that is a sub-category of an accessory category that is a sub-category of a clothing category. Similarly, the scene-based image editing system 106 determines that the object 2508a is part of an overcoat category, which is also a subcategory of a clothing category. Thus, the scene-based image editing system 106 determines that a classification relationship exists between the object 2508b and the object 2508a via the clothing class. As shown in fig. 25D, based on determining that both object 2508b and object 2508a are part of a clothing category, scene-based image editing system 106 provides a text box 2532, which text box 2532 suggests that all other clothing items depicted in digital image 2506 be added to the selection.

Thus, in one or more embodiments, scene-based image editing system 106 predicts objects that are targeted user interactions and facilitates faster selection of those objects based on their classification relationships. In some embodiments, upon selection of multiple objects via the provided suggestions, the scene-based image editing system 106 modifies the selected objects in response to additional user interactions. In effect, scene-based image editing system 106 modifies the selected objects together. Thus, the scene-based image editing system 106 implements a graphical user interface that provides a more flexible and efficient method of selecting and modifying multiple related objects using reduced user interactions.

Indeed, as previously described, the scene-based image editing system 106 provides improved flexibility and efficiency over conventional systems. For example, by selecting objects based on selection of related objects (e.g., automatically or via suggestions), the scene-based image editing system 106 provides a flexible method of targeting multiple objects for modification. In effect, the scene-based image editing system 106 flexibly identifies related objects and includes them in the selection. Thus, the scene-based image editing system 106 implements a graphical user interface that reduces user interaction typically required in conventional systems for selecting and modifying multiple objects.

In one or more embodiments, the scene-based image editing system 106 also pre-processes the digital image to help remove interfering objects. In particular, the scene-based image editing system 106 utilizes machine learning to identify objects in a digital image, classify one or more of the objects as interfering objects, and facilitate removal of the interfering objects to provide a more visually consistent and aesthetically pleasing resulting image. Furthermore, in some cases, scene-based image editing system 106 utilizes machine learning to facilitate removing shadows associated with interfering objects. 26-39C illustrate schematic diagrams of a scene-based image editing system 106 identifying and removing interfering objects and shadows thereof from digital images in accordance with one or more embodiments.

Many conventional systems lack flexibility in the method for removing interfering humans because they deprive control from the user's hand. For example, conventional systems typically automatically remove people classified as interference by them. Thus, such systems do not provide an opportunity for user interaction to provide input regarding the removal process when a digital image is received. For example, these systems do not allow user interaction to remove a person identified to be removed from a group of people.

Furthermore, conventional systems often do not flexibly remove all types of interfering objects. For example, many conventional systems are not flexible in removing shadows cast by interfering objects and non-human objects. Indeed, while some existing systems identify and remove interfering people in digital images, these systems often fail to identify shadows cast by people or other objects in digital images. Thus, the resulting digital image will still include the effects of the interfering human being, as its shadows remain despite the removal of the interfering human being itself. This further results in these conventional systems requiring additional user interaction to identify and remove these shadows.

Scene-based image editing system 106 solves these problems by providing more user control during the removal process while reducing the interactions typically required to delete objects from the digital image. Indeed, as will be explained below, the scene-based image editing system 106 presents the identified interfering objects as a set of objects selected for removal for display. The scene-based image editing system 106 also enables user interactions to add objects to the collection, remove objects from the collection, and/or determine when to delete selected objects. Thus, the scene-based image editing system 106 employs flexible workflows to remove interfering objects based on machine learning and user interactions.

Furthermore, the scene-based image editing system 106 flexibly identifies and removes shadows associated with interfering objects within the digital image. By removing shadows associated with the interfering objects, the scene-based image editing system 106 provides better image results because additional aspects of the interfering objects and their effects within the digital image are removed. This allows for reduced user interaction compared to conventional systems, as the scene-based image editing system 106 does not require additional user interaction to identify and remove shadows.

FIG. 26 illustrates a neural network pipeline used by the scene-based image editing system 106 to identify and remove interfering objects from digital images in accordance with one or more embodiments. In effect, as shown in FIG. 26, scene-based image editing system 106 receives digital image 2602 depicting a plurality of objects. As shown, the scene-based image editing system 106 provides digital images 2602 to a neural network pipeline that includes a segmentation neural network 2604, an interferent detection neural network 2606, a shadow detection neural network 2608, and a repair neural network 2610.

In one or more embodiments, the scene-based image editing system 106 utilizes one of the segmented neural networks described above (e.g., the detection masking neural network 300 discussed with reference to fig. 3) as the segmented neural network 2604. In some embodiments, scene-based image editing system 106 utilizes one of the content-aware machine learning models discussed above as repair neural network 2610 (e.g., cascade modulation repair neural network 420 discussed with reference to fig. 4). The interferent detection neural network 2606 and the shadow detection neural network 2608 will be discussed in more detail below.

As shown in fig. 26, scene-based image editing system 106 utilizes a neural network pipeline to generate modified digital image 2612 from digital image 2602. In particular, the scene-based image editing system 106 utilizes a pipeline of a neural network to identify and remove interfering objects from the digital image 2602. In particular, the scene-based image editing system 106 generates object masks for objects in the digital image using the segmented neural network 2604. The scene-based image editing system 106 utilizes the interferent detection neural network 2606 to determine classifications for objects of the plurality of objects. More specifically, the scene-based image editing system 106 assigns a classification of subject or interfering objects to each object. The scene based image editing system 106 removes interfering objects from the digital image using object masking. In addition, the scene-based image editing system 106 utilizes the repair neural network 2610 to generate a content fill for the portion of the digital image 2602 from which the interfering object is removed to generate a modified digital image 2612. As shown, the scene-based image editing system 106 deletes a variety of different types of interfering objects (multiple men and bars). In practice, the scene-based image editing system 106 is robust enough to identify non-human objects as disturbances (e.g., poles behind girls).

In one or more embodiments, the scene-based image editing system 106 utilizes a subset of the neural network shown in fig. 26 to generate a modified digital image. For example, in some cases, scene-based image editing system 106 utilizes segmented neural network 2604, interferent detection neural network 2606, and content padding 210 to generate a modified digital image from the digital image. Furthermore, in some cases, scene-based image editing system 106 utilizes a different order of neural networks than shown.

Fig. 27 illustrates an architecture of an interferent detection neural network 2700 that is used by the scene-based image editing system 106 to identify and classify interfering objects in digital images in accordance with one or more embodiments. As shown in fig. 27, the interferent detection neural network 2700 includes a heat map network 2702 and an interferent classifier 2704.

As shown, a heat map network 2702 operates on an input image 2706 to generate a heat map 2708. For example, in some cases, heat map network 2702 generates a subject heat map representing potential subject objects and an interferent heat map representing potential interfering objects. In one or more embodiments, the heat map (also referred to as a class activation map) includes predictions made by a convolutional neural network that indicate, in a scale of 0 to 1, a probability value that a particular pixel of an image belongs to a particular class in a set of classes. In contrast to object detection, the goal of a heat map network is to classify individual pixels as part of the same region in some cases. In some cases, the region includes a region in which all pixels in the digital image have the same color or brightness.

In at least one implementation, scene-based image editing system 106 trains heatmap network 2702 over an entire image, including digital images without interfering objects and digital images depicting subject objects and interfering objects.

In one or more embodiments, the heat map network 2702 identifies features in the digital image that help draw conclusions, such as body pose and orientation, that a given region is more likely to be a disturbing object or more likely to be a subject object. For example, in some cases, heat map network 2702 determines that an object having an unrefined picking gesture opposite to a standing-in gesture is likely to be a disturbing object, and also determines that an object facing away from the camera is likely to be a disturbing object. In some cases, the heatmap network 2702 considers other features, such as size, intensity, color, and the like.

In some embodiments, heat map network 2702 classifies regions of input image 2706 as primary objects or interferents and outputs heat map 2708 based on the classification. For example, in some embodiments, heat map network 2702 represents any pixels determined to be part of a subject object as white within a subject heat map and any pixels determined to not be part of a subject object as black (or vice versa). Similarly, in some cases, heat map network 2702 represents any pixels determined to be part of an interfering object as white within the interferent heat map, and any pixels determined not to be part of an interfering object as black (or vice versa).

In some implementations, heat map network 2702 also generates a background heat map representing a possible background as part of heat map 2708. For example, in some cases, heat map network 2702 determines that the background includes an area that is not part of the subject object or the interfering object. In some cases, heat map network 2702 represents any pixels within the background heat map that are determined to be part of the background as white, and any pixels that are determined to be not part of the background as black (and vice versa).

In one or more embodiments, the interferent detection neural network 2700 utilizes the heat map 2708 output by the heat map network 2702 as a heat map prior to the interferent classifier 2704 to indicate a probability that a particular region of the input image 2706 contains an interfering object or subject object.

In one or more embodiments, the interferent detection neural network 2700 utilizes the interferent classifier 2704 to consider global information included in the heat map 2708 and local information included in the one or more individual objects 2710. To illustrate, in some embodiments, interferent classifier 2704 generates a score for classification of the object. If an object in the digital image appears to be a subject object based on the local information, but the heat map 2708 indicates with high probability that the object is an interfering object, the interference classifier 2704 concludes in some cases that the object is indeed an interfering object. On the other hand, if the heat map 2708 points to an object that is a subject object, the interferent classifier 2704 determines that the object has been confirmed as a subject object.

As shown in fig. 27, interferent classifier 2704 includes crop generator 2712 and hybrid classifier 2714. In one or more embodiments, interferent classifier 2704 receives one or more individual objects 2710 that have been identified from input image 2706. In some cases, one or more individual objects 2710 are identified via user annotations or some object detection network (e.g., object detection machine learning model 308 discussed above with reference to fig. 3).

As shown in fig. 27, interferent classifier 2704 utilizes crop generator 2712 to generate crop image 2716 by cropping input image 2706 based on the position of one or more individual objects 2710. For example, in the case where there are three object detections in the input image 2706, the clip generator 2712 generates three clip images—a clip image for each detected object. In one or more embodiments, crop generator 2712 generates a crop image by removing all pixels of input image 2706 outside the location of the respective inferred boundary region.

As further shown, interferent classifier 2704 also utilizes crop generator 2712 to generate crop heat map 2718 by cropping heat map 2708 with respect to each detected object. For example, in one or more embodiments, crop generator 2712 generates a crop heat map for each detected object based on an area within the heat map that corresponds to the location of the detected object from each of the body heat map, the interferent heat map, and the background heat map.

In one or more embodiments, for each individual object of the one or more individual objects 2710, the interferent classifier 2704 utilizes the hybrid classifier 2714 to operate on the corresponding cropped image (e.g., its features) and the corresponding cropping heat map (e.g., its features) to determine whether the object is a subject object or an interfering object. To illustrate, in some embodiments, for a detected object, hybrid classifier 2714 performs operations on a cropped image associated with the detected object and a cropped heat map associated with the detected object (e.g., a cropped heat map derived from heat map 2708 based on the position of the detected object) to determine whether the detected object is a subject object or an interfering object. In one or more embodiments, the interferent classifier 2704 combines features of the detected cropped image of the object with features of the corresponding cropped heat map (e.g., via a connection or additional features) and provides the combination to the hybrid classifier 2714. As shown in fig. 27, the hybrid classifier 2714 generates a binary decision 2720 from its corresponding clip image and clip heat map, which includes a tag for a detected object as a subject object or an interfering object.

Fig. 28 illustrates an architecture of a thermal map network 2800 that is used as part of an interferent detection neural network by the scene-based image editing system 106 in accordance with one or more embodiments. As shown in fig. 28, the heatmap network 2800 includes a convolutional neural network 2802 as its encoder. In one or more embodiments, convolutional neural network 2802 includes a depth residual network. As further shown in fig. 28, the thermal map network 2800 includes a thermal map header 2804 as its decoder.

FIG. 29 illustrates an architecture of a hybrid classifier 2900 that is used as part of an interferent detection neural network by a scene-based image editing system 106 in accordance with one or more embodiments. As shown in fig. 29, hybrid classifier 2900 includes a convolutional neural network 2902. In one or more embodiments, hybrid classifier 2900 uses convolutional neural network 2902 as an encoder.

To illustrate, in one or more embodiments, the scene-based image editing system 106 provides features of the cropped image 2904 to the convolutional neural network 2902. In addition, the scene-based image editing system 106 provides features of the cropping heat map 2906 corresponding to the objects of the cropping image 2904 to the inner layer 2910 of the hybrid classifier 2900. Specifically, as shown, in some cases, the scene-based image editing system 106 connects the features of the crop heat map 2906 with the output of the previous internal layers (via cascading operation 2908) and provides the resulting feature map to the internal layers 2910 of the hybrid classifier 2900. In some embodiments, the signature includes 2048+n channels, where N corresponds to the channels of the output of the heat map network, and 2048 corresponds to the channels of the output of the previous inner layer (although 2048 is one example).

As shown in fig. 29, the hybrid classifier 2900 convolves the output of the inner layer 2910 to reduce the channel depth. In addition, hybrid classifier 2900 performs another convolution on the output of subsequent inner layer 2914 to further reduce channel depth. In some cases, hybrid classifier 2900 applies pooling to the output of the final internal layer 2916 preceding binary classification header 2912. For example, in some cases, hybrid classifier 2900 averages the values of the final inner layer output to generate an average value. In some cases where the average value is above the threshold, hybrid classifier 2900 classifies the corresponding object as an interfering object and outputs a corresponding binary value; otherwise, hybrid classifier 2900 classifies the corresponding object as a subject object and outputs the corresponding binary value (or vice versa). Thus, hybrid classifier 2900 provides output 2918 containing labels for corresponding objects.

FIGS. 30A-30C illustrate a graphical user interface implemented by the scene-based image editing system 106 to identify and remove interfering objects from digital images in accordance with one or more embodiments. For example, as shown in fig. 30A, the scene-based image editing system 106 provides a digital image 3006 for display within a graphical user interface 3002 of the client device 3004. As further shown, the digital image 3006 depicts an object 3008 and a plurality of additional objects 3010a-3010d.

In addition, as shown in FIG. 30A, the scene based image editing system 106 provides a progress indicator 3012 for display within the graphical user interface 3002. In some cases, scene based image editing system 106 provides progress indicator 3012 to indicate that digital image 3006 is being analyzed for interfering objects. For example, in some embodiments, the scene-based image editing system 106 provides a progress indicator 3012 while utilizing an interferent detection neural network to identify and classify interfering objects within the digital image 3006. In one or more embodiments, the scene-based image editing system 106 automatically implements the jammer detection neural network upon receiving the digital image 3006 and prior to receiving user input for modifying one or more of the objects 3010a-3010 d. However, in some implementations, the scene-based image editing system 106 waits for user input to be received and then analyzes the digital image 3006 for interfering objects.

As shown in FIG. 30B, scene based image editing system 106 provides visual indicators 3014a-3014d for display within graphical user interface 3002 upon completion of an analysis. In particular, scene based image editing system 106 provides visual indicators 3014a-3014d to indicate that objects 3010a-3010d have been classified as interfering objects.

In one or more embodiments, scene-based image editing system 106 also provides visual indicators 3014a-3014d to indicate that objects 3010a-3010d have been selected for deletion. In some cases, scene-based image editing system 106 also presents pre-generated object masks for objects 3010a-3010d in preparation for deleting the objects. In fact, as already discussed, the scene-based image editing system 106 pre-generates object masking and content filling (e.g., using the above-referenced segmentation neural network 2604 and repair neural network 2610) for objects of the digital image. Thus, scene-based image editing system 106 has object masking and content filling that are readily used to modify objects 3010a-3010 d.

In one or more embodiments, the scene-based image editing system 106 enables user interactions to be added to or removed from the selection of objects to be deleted. For example, in some embodiments, upon detecting a user interaction with object 3010a, scene-based image editing system 106 determines to omit object 3010a from the delete operation. In addition, scene-based image editing system 106 removes visual indication 3014a from the display of graphical user interface 3002. On the other hand, in some implementations, the scene-based image editing system 106 detects user interactions with the object 3008 and, in response, determines to include the object 3008 in the delete operation. Furthermore, in some cases, scene-based image editing system 106 provides visual indications for object 3008 for display and/or presentation of pre-generated object masks for object 3008 in preparation for deletion.

As further shown in fig. 30B, scene based image editing system 106 provides remove option 3016 for display within graphical user interface 3002. In one or more embodiments, in response to detecting user interaction with the remove option 3016, the scene-based image editing system 106 removes objects that have been selected for deletion (e.g., objects 3010a-3010d that have been classified as interfering objects). In effect, as shown in FIG. 30C, scene-based image editing system 106 removes objects 3010a-3010d from digital image 3006. Further, as shown in 30C, upon removal of the objects 3010a-3010d, the scene-based image editing system 106 exposes previously generated content fills 3018a-3018d.

By enabling user interaction to control which objects are included in the delete operation and further selecting when to remove selected objects, scene-based image editing system 106 provides greater flexibility. Indeed, while conventional systems typically automatically delete interfering objects without user input, the scene-based image editing system 106 allows the interfering objects to be deleted according to user preferences expressed via user interaction. Thus, the scene-based image editing system 106 flexibly allows for control of the removal process via user interaction.

In various embodiments, in addition to removing interfering objects identified via the interferer detection neural network, the scene-based image editing system 106 provides other features for removing unwanted portions of the digital image. For example, in some cases, scene-based image editing system 106 provides a tool whereby user interactions can be deleted for any portion of the digital image. 31A-31C illustrate a graphical user interface implemented by the scene-based image editing system 106 to identify and remove interfering objects from digital images in accordance with one or more embodiments.

In particular, fig. 31A shows a digital image 3106 displayed on a graphical user interface 3102 of a client device 3104. The digital image 3106 corresponds to the digital image 3006 of fig. 30C after the interfering object identified by the interfering object detection neural network has been removed. Thus, in some cases, the objects remaining in the digital image 3106 represent those objects that are not identified as interfering objects and are removed. For example, in some cases, the set of objects 3110 near the horizon of the digital image 3106 includes objects that are not identified as interfering objects by the interferent detection neural network.

As further shown in fig. 31A, scene-based image editing system 106 provides a brush tool option 3108 for display within graphical user interface 3102. Fig. 31B illustrates that upon detecting a user interaction with the brush tool option 3108, the scene-based image editing system 106 enables one or more user interactions to use the brush tool to select any portion of the digital image 3106 (e.g., a portion not identified by the interferent detection neural network) for removal. For example, as shown, the scene-based image editing system 106 receives one or more user interactions with the graphical user interface 3102 that target a portion of the digital image 3106 depicting the set of objects 3110.

As shown in fig. 31B, with the brush tool, scene-based image editing system 106 allows free-form user input in some cases. In particular, fig. 31B illustrates that scene-based image editing system 106 provides visual indication 3112 representing a portion (e.g., a target specific pixel) of digital image 3106 selected via a brush tool. In fact, rather than receiving user interactions with previously identified objects or other pre-segmented semantic regions, the scene-based image editing system 106 uses a brush tool to enable arbitrary selection of portions of the digital image 3106. Thus, scene-based image editing system 106 utilizes a brush tool to provide additional flexibility whereby user interaction can specify undesirable areas of a digital image that may not be identified by machine learning.

As further shown in fig. 31B, scene based image editing system 106 provides remove option 3114 for display within graphical user interface 3102. As shown in fig. 31C, in response to detecting a user interaction with the remove option 3114, the scene-based image editing system 106 removes a selected portion of the digital image 3106. Further, as shown, scene-based image editing system 106 populates the selected portion with content populating 3116. In one or more embodiments, where the portion removed from digital image 3106 does not include an object for which a content fill was previously selected (or otherwise includes additional pixels not included in the previously generated content fill), scene-based image editing system 106 generates content fill 3116 after removing the portion of digital image 3106 selected via the brush tool. Specifically, after removing the selected portion, scene-based image editing system 106 generates content fill 3116 using the content-aware hole-fill machine learning model.

In one or more embodiments, the scene-based image editing system 106 also implements smart augmentation (smart condition) when objects are removed from the digital image. For example, in some cases, the scene-based image editing system 106 utilizes intelligent dilation to remove objects that touch, overlap, or otherwise come near other objects depicted in the digital image. FIG. 32A illustrates a scene-based image editing system 106 utilizing intelligent dilation to remove objects from a digital image in accordance with one or more embodiments.

In general, conventional systems utilize tight masking (e.g., masking that is tightly adhered to the boundary of a corresponding object) to remove the object from the digital image. In many cases, however, the digital image includes bleeding or artifacts around the boundary of the object. For example, there are some image formats (JPEG) which are particularly susceptible to the presence of format-dependent artifacts around object boundaries. When these problems occur, the use of tight masking can produce adverse effects in the generated image. For example, repair models are often sensitive to these image imperfections, and can produce significant artifacts when directly manipulating the segmentation output. Thus, the resulting modified image inaccurately captures the user's intent over creating additional image noise to remove the object.

Thus, when an object is removed, the scene-based image editing system 106 expands (e.g., expands) the object mask of the object to avoid associated artifacts. However, dilated object masking risks removing portions of other objects depicted in the digital image. For example, in the event that a first object to be removed overlaps, contacts, or is in proximity to a second object, the expanding mask of the first object will typically extend into the space occupied by the second object. Thus, when the first object is removed using the dilated object mask, a larger portion of the second object is typically removed and the resulting hole is (typically incorrectly) filled, causing an undesirable effect in the resulting image. Thus, the scene-based image editing system 106 utilizes intelligent dilation to avoid significant expansion of object masking of objects to be removed into the region of the digital image occupied by other objects.

As shown in fig. 32A, the scene-based image editing system 106 determines to remove the object 3202 depicted in the digital image 3204. For example, in some cases, the scene-based image editing system 106 determines (e.g., via an interferent detection neural network) that the object 3202 is an interfering object. In some implementations, the scene-based image editing system 106 receives a user selection of an object 3202 to remove. Digital image 3204 also depicts objects 3206a-3206b. As shown, the object 3202 selected for removal overlaps with the object 3206b in the digital image 3204.

As further shown in fig. 32A, scene-based image editing system 106 generates object masks 3208 for objects 3202 to be removed and combined object masks 3210 for objects 3206a-3206b. For example, in some embodiments, scene-based image editing system 106 generates object mask 3208 and combined object mask 3210 from digital image 3204 using a segmented neural network. In one or more embodiments, scene-based image editing system 106 generates combined object masks 3210 by generating object masks for each of objects 3206a-3206b and determining a union between the individual object masks.

Further, as shown in fig. 32A, scene-based image editing system 106 performs an action 3212 of expanding object mask 3208 for object 3202 to be removed. In particular, scene-based image editing system 106 expands the representation of objects 3202 within object mask 3208. In other words, scene-based image editing system 106 adds pixels to boundaries of the representation of objects within object mask 3208. The amount of expansion varies in various embodiments and is configurable in some implementations to accommodate user preferences. For example, in one or more implementations, scene-based image editing system 106 expands object masks by expanding object masks outward by 10, 15, 20, 25, or 30 pixels.

After extending object mask 3208, scene-based image editing system 106 performs an act 3214 of detecting an overlap between the extended object mask of object 3202 and the object masks of other detected objects 3206a-3206b (i.e., combined object mask 3210). In particular, scene-based image editing system 106 determines where pixels corresponding to the extended representation of object 3202 within the extended object mask overlap with pixels corresponding to objects 3206a-3206b within combined object mask 3210. In some cases, scene-based image editing system 106 determines a union between extended object mask and combined object mask 3210 and uses the resulting union to determine the overlap. Scene-based image editing system 106 also performs an act 3216 of removing the overlapping portion from the extended object mask of object 3202. In other words, scene-based image editing system 106 removes pixels from the representation of object 3202 within the extended object mask that overlap pixels within combined object mask 3210 that correspond to object 3206a and/or object 3206 b.

Thus, as shown in fig. 32A, the scene-based image editing system 106 generates a smart augmented object mask 3218 (e.g., an extended object mask) for the object 3202 to be removed. In particular, scene-based image editing system 106 generates intelligently-expanded object masks 3218 by expanding object masks 3208 in areas that do not overlap any of objects 3206a-3206b and avoiding expansion in areas that overlap at least one of objects 3206a-3206 b. In at least some implementations, the scene-based image editing system 106 reduces expansion in overlapping areas. For example, in some cases, the intelligently-expanded object mask 3218 still includes an expansion in the overlapping region, but the expansion is significantly smaller when compared to the region where no overlap exists. In other words, scene-based image editing system 106 expands using fewer pixels in areas where there is overlap. For example, in one or more implementations, the scene-based image editing system 106 expands or dilates the object mask by a factor of 5, 10, 15, or 20 in areas where no overlap exists as compared to areas where overlap exists.

In other words, in one or more embodiments, scene-based image editing system 106 generates intelligently-expanded object masks 3218 (e.g., expanded object masks) by expanding object masks 3208 of objects 3202 to object mask unoccupied areas of objects 3206a-3206b (e.g., areas of objects 3206a-3206b themselves are unoccupied). For example, in some cases, the scene-based image editing system 106 expands the object mask 3208 into portions of the digital image 3204 that adjoin the object mask 3208. In some cases, scene-based image editing system 106 expands object mask 3208 to an adjoining portion of a set number of pixels. In some implementations, the scene-based image editing system 106 utilizes a different number of pixels to extend the object mask 3208 into different contiguous portions (e.g., based on detecting overlapping regions between the object mask 3208 and other object masks).

To illustrate, in one or more embodiments, the scene-based image editing system 106 expands the object mask 3208 into the foreground and background of the digital image 3204. In particular, the scene-based image editing system 106 determines the foreground by combining object masking of objects that are not deleted. The scene-based image editing system 106 expands the object mask 3208 to contiguous foreground and background. In some implementations, the scene-based image editing system 106 expands the object mask 3208 to a first amount in the foreground and expands the object mask 3208 to a second amount in the background that is different from the first amount (e.g., the second amount is greater than the first amount). For example, in one or more implementations, scene-based image editing system 106 expands object masks 20 pixels into the background region and expands two pixels into the foreground region (to contiguous object masks, such as combined object mask 3210).

In one or more embodiments, the scene-based image editing system 106 determines a first amount for extending the object mask 3208 into the foreground by a second amount—the same amount as is used to extend the object mask 3208 into the background. In other words, the scene-based image editing system 106 expands the object mask 3208 as a whole to the same amount in the foreground and background (e.g., using the same number of pixels). The scene-based image editing system 106 also determines an overlap region between the extended object mask and the object mask (e.g., the combined object mask 3210) corresponding to the other objects 3206a-3206 b. In one or more embodiments, the overlapping region exists in a foreground of the digital image 3204 adjacent to the object mask 3208. Thus, the scene-based image editing system 106 reduces the expansion of the object mask 3208 into the foreground such that the expansion corresponds to the second amount. Indeed, in some cases, scene-based image editing system 106 removes overlapping regions (e.g., removes overlapping pixels) from the extended object mask of object 3202. In some cases, scene-based image editing system 106 removes a portion of the overlapping region, but not the entire overlapping region, resulting in a reduced overlap between the extended object mask of object 3202 and the object masks corresponding to objects 3206a-3206 b.

In one or more embodiments, because removing object 3202 includes removing foreground and background adjacent to intelligently-expanded object mask 3218 (e.g., expanded object mask) generated for object 3202, scene-based image editing system 106 repairs remaining holes after removal. Specifically, scene-based image editing system 106 repairs the hole with foreground pixels and background pixels. Indeed, in one or more embodiments, scene-based image editing system 106 utilizes a repair neural network to generate foreground pixels and background pixels for the generated hole and utilizes the generated pixels to repair the hole, thereby generating a modified digital image (e.g., a repair digital image) in which object 3202 has been removed and the corresponding portion of digital image 3204 has been filled.

For example, FIG. 32B illustrates the advantages provided by intelligently expanding object masking before performing repair. In particular, fig. 32B illustrates that when a smart expanded object mask 3218 (e.g., an expanded object mask) is provided to a repair neural network (e.g., cascade modulation repair neural network 420) as an area to be filled, the repair neural network generates a modified digital image 3220, wherein the area corresponding to the smart expanded object mask 3218 fills pixels generated by the repair neural network. As shown, the modified digital image 3220 does not include artifacts in the repair area corresponding to the smart augmented object mask 3218. In effect, the modified digital image 3220 provides a realistic looking image.

In contrast, fig. 32B illustrates that when an object mask 3208 (e.g., an unexpanded object mask) is provided to a repair neural network (e.g., cascade modulation repair neural network 420) as an area to be filled, the repair neural network generates a modified digital image 3222, wherein the area corresponding to the intelligently expanded object mask 3218 fills in pixels generated by the repair neural network. As shown, the modified digital image 3222 includes artifacts in the repair area corresponding to the object mask 3208. Specifically, in the generated water, there are artifacts along the back of the girl and the event.

By generating intelligently-expanded object masks, the scene-based image editing system 106 provides improved image results when objects are removed. In effect, the scene-based image editing system 106 utilizes expansion to remove artifacts, color loss, or other undesirable errors in the digital image, but avoids removing significant portions of other objects that remain in the digital image. Thus, the scene-based image editing system 106 is able to fill in the hole left by the removed object, if possible, without enhancing the current error, without unnecessarily replacing portions of the remaining other objects.

As previously described, in one or more embodiments, scene-based image editing system 106 also utilizes a shadow detection neural network to detect shadows associated with interfering objects depicted within the digital image. Fig. 33-38 illustrate diagrams of a shadow detection neural network used by a scene-based image editing system 106 to detect shadows associated with an object in accordance with one or more embodiments.

In particular, FIG. 33 illustrates an overview of a shadow detection neural network 3300 in accordance with one or more embodiments. In practice, as shown in fig. 33, the shadow detection neural network 3300 analyzes the input image 3302 through a first stage 3304 and a second stage 3310. Specifically, first stage 3304 includes an instance segmentation component 3306 and an object perception component 3308. In addition, the second stage 3310 includes a shadow prediction component 3312. In one or more embodiments, the example segmentation component 3306 includes the segmented neural network 2604 of the neural network pipeline discussed above with reference to fig. 26.

As shown in FIG. 33, after analyzing the input image 3302, the shadow detection neural network 3300 identifies the objects 3314a-3314c and shadows 3316a-3316c depicted therein. In addition, shadow detection neural network 3300 associates objects 3314a-3314c with their respective shadows. For example, shadow detection neural network 3300 associates object 3314a with shadow 3316a, and similarly associates other objects with shadows. Thus, when its associated object is selected for deletion, movement, or some other modification, shadow detection neural network 3300 facilitates the inclusion of shadows.

FIG. 34 illustrates an overview of an example segmentation component 3400 of a shadow detection neural network in accordance with one or more embodiments. As shown in fig. 34, an instance segmentation component 3400 implements an instance segmentation model 3402. As shown in FIG. 34, the instance segmentation component 3400 utilizes the instance segmentation model 3402 to analyze the input image 3404 and identify objects 3406a-3406c depicted therein based upon the analysis. For example, in some cases, the scene-based image editing system 106 outputs object masks and/or bounding boxes for the objects 3406a-3406c.

FIG. 35 illustrates an overview of an object perception component 3500 of a shadow detection neural network in accordance with one or more embodiments. In particular, FIG. 35 shows input image instances 3502a-3502c corresponding to each object detected within the digital image via the previous instance segmentation component. In particular, each input image instance corresponds to a different detected object and corresponds to an object mask and/or bounding box generated for the digital image. For example, input image instance 3502a corresponds to object 3504a, input image instance 3502b corresponds to object 3504b, and input image instance 3502c corresponds to object 3504c. Thus, input image instances 3502a-3502c illustrate separate object detection provided by an instance segmentation component of a shadow detection neural network.

In some embodiments, for each detected object, scene-based image editing system 106 generates input for the second stage of the shadow detection neural network (i.e., the shadow prediction component). Fig. 35 illustrates an object perception component 3500 that generates input 3506 for an object 3504 a. In effect, as shown in FIG. 35, object perception component 3500 generates input 3506 using input image 3508, object masks 3510 (referred to as object perception channels) corresponding to objects 3504a, and combined object masks 3512 (referred to as object discrimination channels) corresponding to objects 3504b-3504 c. For example, in some implementations, object perception component 3500 combines (e.g., connects) input image 3508, object mask 3510, and combined object mask 3512. Object awareness component 3500 similarly generates a second level of input for other objects 3504b-3504c as well (e.g., utilizing their respective object masks and a combined object mask representing the other objects along with input image 3508).

In one or more embodiments, scene-based image editing system 106 generates combined object mask 3512 (e.g., via object perception component 3500 or some other component of a shadow detection neural network) using a union (unit) of individual object masks generated for object 3504b and object 3504 c. In some cases, object awareness component 3500 does not utilize object discrimination channels (e.g., combined object mask 3512). Rather, object perception component 3500 uses input image 3508 and object mask 3510 to generate input 3506. However, in some embodiments, using object discrimination channels provides better shadow prediction in the second stage of the shadow detection neural network.

FIG. 36 illustrates an overview of a shadow prediction component 3600 of a shadow detection neural network in accordance with one or more embodiments. As shown in fig. 36, the shadow detection neural network provides inputs to a shadow prediction component 3600 that are compiled by an object-aware component that includes an input image 3602, an object mask 3604 for objects of interest, and a combined object mask 3606 for other detected objects. Shadow prediction component 3600 utilizes shadow segmentation model 3608 to generate a first shadow prediction 3610 for an object of interest and a second shadow prediction 3612 for other detected objects. In one or more embodiments, first shadow prediction 3610 and/or second shadow prediction 3612 includes shadow masking for a corresponding shadow (e.g., where shadow masking includes object masking for a shadow). In other words, shadow prediction component 3600 utilizes shadow segmentation model 3608 to generate first shadow prediction 3610 by generating shadow masks for predicted shadows of an object of interest. Similarly, shadow prediction component 3600 utilizes shadow segmentation model 3608 to generate second shadow prediction 3612 by generating a combined shadow mask for shadows predicted for other detected objects.

Based on the output of the shadow segmentation model 3608, the shadow prediction component 3600 provides object-shadow pair prediction 3614 for an object of interest. In other words, shadow prediction component 3600 associates an object of interest with its shadow cast within input image 3602. In one or more embodiments, shadow prediction component 3600 similarly generates object-shadow pair predictions for all other objects depicted in input image 3602. Accordingly, shadow prediction component 3600 identifies shadows depicted in a digital image and associates each shadow with its corresponding object.

In one or more embodiments, the shadow segmentation model 3608 used by the shadow prediction component 3600 includes a segmented neural network. For example, in some cases, the shadow segmentation model 3608 includes the detection masking neural network discussed above with reference to fig. 3. As another embodiment, the shadow segmentation model 3608 includes a deep labv3 semantic segmentation model described by: liang-Chieh Chen et al Rethinking Atrous Convolution for Semantic Image Segmentation (re-thinking Atrous convolution for semantic image segmentation), arXiv:1706.05587 2017, or by deep lab semantic segmentation model described below: liang-Chieh Chen et al, "deep: semantic Image Segmentation with Deep Convolutional Nets, atrous Convolution, and Fully Connected CRFs" (deep: semantic image segmentation based on deep convolutional networks, atrous convolution, and fully connected CRF), arXiv:1606.00915, 2016, the entire contents of both of which are incorporated herein by reference.

Fig. 37 illustrates an overview of an architecture of a shadow detection neural network 3700 in accordance with one or more embodiments. In particular, FIG. 37 illustrates a shadow detection neural network 3700 that includes the example segmentation component 3400 discussed with reference to FIG. 34, the object perception component 3500 discussed with reference to FIG. 35, and the shadow prediction component 3600 discussed with reference to FIG. 36. Further, fig. 37 shows that the shadow detection neural network 3700 generates object masking, shadow masking, and prediction with respect to each object depicted in the input image 3702. Thus, shadow detection neural network 3700 outputs a final prediction 3704 that associates each object depicted in the digital image with its shadow. Thus, as shown in fig. 37, the shadow detection neural network 3700 provides an end-to-end neural network framework that receives the digital image and outputs an association between the object and the shadow described therein.

In some implementations, the shadow detection neural network 3700 determines that an object depicted within the digital image does not have an associated shadow. Indeed, in some cases, the shadow detection neural network 3700 determines that shadows associated with an object are not depicted within the digital image when the digital image is analyzed with its various components. In some cases, scene-based image editing system 106 provides feedback indicating the lack of shadows. For example, in some cases, upon determining that no shadow is depicted within the digital image (or that no shadow associated with a particular object is present), the scene-based image editing system 106 provides a message for display or other feedback indicating the lack of shadow. In some cases, scene-based image editing system 106 does not provide explicit feedback, but does not automatically select or provide suggestions for including shadows within the object selection, as discussed below with reference to fig. 39A-39C.

In some implementations, when an object mask for an object has been generated, the scene-based image editing system 106 utilizes a second level (second stage) of the shadow detection neural network to determine shadows associated with the object depicted in the digital image. In effect, FIG. 38 illustrates a diagram for determining shadows associated with an object depicted in a digital image using a second stage of a shadow detection neural network in accordance with one or more embodiments.

As shown in fig. 38, the scene-based image editing system 106 provides an input image 3804 to a second stage 3800 of the shadow detection neural network (i.e., the shadow prediction model 3802). In addition, the scene-based image editing system 106 provides object masking 3806 to the second stage 3800. The scene-based image editing system 106 utilizes the second stage 3800 of the shadow detection neural network to generate shadow masks 3808 for shadows of objects depicted in the input image 3804, resulting in an association between the objects and shadows cast by objects within the input image 3804 (e.g., as shown in the visualization 3810).

By providing direct access to the second level of the shadow detection neural network, the scene-based image editing system 106 provides flexibility in shadow detection. Indeed, in some cases, object masks have been created for objects depicted in digital images. For example, in some cases, the scene-based image editing system 106 implements a separate segmented neural network to generate object masks for digital images as part of a separate workflow. Thus, object masking of the object already exists and the scene-based image editing system 106 utilizes previous work to determine the shadow of the object. Thus, the scene-based image editing system 106 also provides efficiency in that it avoids duplication by directly accessing the shadow prediction model of the shadow detection neural network.

39A-39C illustrate a graphical user interface implemented by the scene-based image editing system 106 to identify and remove shadows of objects depicted in a digital image, in accordance with one or more embodiments. In effect, as shown in fig. 39A, the scene-based image editing system 106 provides a digital image 3906 depicting an object 3908 for display within a graphical user interface 3902 of the client device 3904. As further shown, object 3908 casts shadow 3910 within digital image 3906.

In one or more embodiments, upon receiving digital image 3906, scene-based image editing system 106 utilizes a shadow-detection neural network to analyze digital image 3906. In particular, scene-based image editing system 106 utilizes a shadow-detection neural network to identify object 3908, identify shadows 3910 cast by object 3908, and further associate shadows 3910 with object 3908. As previously described, in some implementations, scene-based image editing system 106 also utilizes a shadow detection neural network to generate object masks for object 3908 and shadow 3910.

As previously discussed with reference to fig. 26, in one or more embodiments, the scene-based image editing system 106 identifies shadows cast by objects within the digital image as part of a neural network pipeline for identifying interfering objects within the digital image. For example, in some cases, the scene-based image editing system 106 uses a segmentation neural network to identify objects of the digital image, uses an interferent detection neural network to classify one or more objects as interfering objects, uses a shadow detection neural network to identify shadows and associate shadows with their corresponding objects, and uses a repair neural network to generate content fills to replace the removed objects (and shadows thereof). In some cases, the scene-based image editing system 106 automatically implements neural network pipelining in response to receiving the digital image.

In effect, as shown in fig. 39B, the scene-based image editing system 106 provides a visual indication 3912 for display within the graphical user interface 3902, the visual indication 3912 indicating a selection of an object 3908 to remove. As further shown, scene-based image editing system 106 provides visual indication 3914 for display, visual indication 3914 indicating selection of shadows 3910 to remove. As suggested, in some cases, scene-based image editing system 106 automatically (e.g., upon determining that object 3908 is an interfering object) selects object 3908 and shadow 3910 for deletion. However, in some implementations, scene-based image editing system 106 selects object 3908 and/or shadow 3910 in response to receiving one or more user interactions.

For example, in some cases, scene-based image editing system 106 receives a user's selection of object 3908 and automatically adds shadow 3910 to the selection. In some implementations, the scene-based image editing system 106 receives a user selection of an object 3908 and provides suggestions for display in the graphical user interface 3902 that suggest shadows 3910 to add to the selection. In response to receiving the additional user interaction, scene-based image editing system 106 adds shadow 3910.

As further shown in fig. 39B, the scene-based image editing system 106 provides a remove option 3916 for display within the graphical user interface 3902. As shown in fig. 39C, upon receiving a selection of remove option 3916, scene-based image editing system 106 removes object 3908 and shadow 3910 from the digital image. As further shown, scene-based image editing system 106 replaces object 3908 with content fill 3918 and replaces shadow 3910 with content fill 3920. In other words, scene-based image editing system 106 exposes content fill 3918 and content fill 3920, respectively, when object 3908 and shadow 3910 are removed.

While fig. 39A-39C illustrate implementation of shadow detection for delete operations, it should be noted that in various embodiments, scene-based image editing system 106 implements shadow detection for other operations (e.g., move operations). Further, while fig. 39A-39C are discussed with respect to removing interfering objects from digital images, the scene-based image editing system 106 enables shadow detection in the context of other features described herein. For example, in some cases, scene-based image editing system 106 enables object-aware modified shadow detection that directly targets objects with respect to user interactions. Thus, the scene-based image editing system 106 provides further advantages for object-aware modification by segmenting objects and their shadows and generating corresponding content fills to modify the objects to allow seamless interaction with digital images, and thus seamless interaction with digital images, prior to receiving user interactions.

By identifying shadows cast by objects within the digital image, the scene-based image editing system 106 provides improved flexibility over conventional systems. In practice, the scene-based image editing system 106 flexibly identifies objects within the digital image as well as other aspects of those objects depicted in the digital image (e.g., their shadows). Thus, the scene-based image editing system 106 provides better image results when removing or moving objects, as it accommodates these other aspects. This further results in reduced user interaction with the graphical user interface because the scene-based image editing system 106 does not require user interaction for moving or removing shadows of objects (e.g., user interaction for identifying shadow pixels and/or binding shadow pixels to objects).

In some implementations, the scene-based image editing system 106 implements one or more additional features to facilitate modification of the digital image. In some embodiments, these features provide additional user interface-based efficiencies in that they reduce the amount of user interaction with the user interface that is typically required to perform certain actions in the context of image editing. In some cases, these features also facilitate deployment of scene-based image editing systems 106 on computing devices with limited screen space, as they effectively use available space to facilitate image modification without crowding the display with unnecessary visual elements.

In some cases, these additional features incorporate three-dimensional effects into the image editing process. For example, in one or more embodiments, the scene-based image editing system 106 implements perspective-aware object movement operations when editing digital images. In particular, the scene-based image editing system 106 moves objects within the digital image based on a three-dimensional perspective associated with the digital image. In some cases, the scene-based image editing system 106 further modifies the object (e.g., automatically) via perspective-based sizing based on the moving object. In other words, the scene-based image editing system 106 adjusts the size of the object as the object moves according to the three-dimensional perspective of the digital image. 40-45C illustrate a scene-based image editing system 106 implementing perspective-aware object movement operations in accordance with one or more embodiments.

As described above, in some implementations, the full convolution model is affected by a slow increase in the effective perceived field, especially at an early stage of the network. Thus, using a stride convolution within the encoder may generate invalid features within the hole region, making feature correction at the decoding stage more challenging. Fast Fourier Convolution (FFC) can help the early layers obtain a perceived field that covers the entire image. However, conventional systems use FFCs only at the bottleneck layer, which is computationally demanding. Furthermore, the shallow bottleneck layer cannot effectively capture global semantic features. Thus, in one or more implementations, the scene-based image editing system 106 replaces the convolution blocks in the encoder with FFCs of the encoding layer. The FFC enables the encoder to propagate features at an early stage, solving the problem of creating invalid features in the hole, which helps improve the results.

As further shown in fig. 5, the cascade modulation repair neural network 502 also includes a decoder 506. As shown, decoder 506 includes a plurality of cascaded modulation layers 520a-520n. The cascade modulation layers 520a-520n process the input features (e.g., the input global feature map and the input local feature map) to generate new features (e.g., a new global feature map and a new local feature map). Specifically, each of the cascaded modulation layers 520a-520n operates at a different scale/resolution. Thus, the first cascaded modulation layer 520a takes input features at a first resolution/scale and generates new features at a lower scale/higher resolution (e.g., through upsampling as part of one or more modulation operations). Similarly, the additional cascaded modulation layers operate at lower scales/higher resolutions until a restored digital image is generated at the output scales/resolutions (e.g., lowest scale/highest resolution).

Further, each of the cascaded modulation layers includes a plurality of modulation blocks. For example, with respect to fig. 5, the first cascaded modulation layer 520a includes global modulation blocks and spatial modulation blocks. Specifically, the cascade modulation repair neural network 502 performs global modulation with respect to the input features of the global modulation block. In addition, the cascade modulation repair neural network 502 performs spatial modulation with respect to the input features of the spatial modulation block. By performing global and spatial modulation within each cascaded modulation layer, scene-based image editing system 106 refines the global position to generate a more accurate repair digital image.

As shown, the cascaded modulation layers 520a-520n are cascaded in that the global modulation block is fed into the spatial modulation block. Specifically, the cascade modulation repair neural network 502 performs spatial modulation at a spatial modulation block based on the features generated at the global modulation block. To illustrate, in one or more embodiments, the cascade modulation repair neural network 502 utilizes global modulation blocks to generate intermediate features. The cascade modulation repair neural network 502 also utilizes a convolutional layer (e.g., a 2-layer convolutional affine parameter network) to convert the intermediate features into spatial tensors. The cascade modulation repair neural network 502 utilizes spatial tensors to modulate the input features analyzed by the spatial modulation block.

For example, fig. 6 provides additional details regarding the operation of the global and spatial modulation blocks in accordance with one or more embodiments. Specifically, fig. 6 shows a global modulation block 602 and a spatial modulation block 603. As shown in fig. 6, the global modulation block 602 includes a first global modulation operation 604 and a second global modulation operation 606. In addition, spatial modulation block 603 includes global modulation operation 608 and spatial modulation operation 610.

For example, the modulation block (or modulation operation) includes a computer-implemented process for modulating (e.g., scaling or shifting) an input signal according to one or more conditions (conditions). To illustrate, the modulation block includes amplifying certain features while counteracting/normalizing these amplifications to maintain operation within the generative model. Thus, for example, in some cases, the modulation block (or modulation operation) includes a modulation layer, a convolution layer, and a normalization layer. The modulation layer scales each input feature of the convolution and the normalization removes the scaled effect of the statistics from the convolved output feature map.

In practice, because the modulation layer modifies the feature statistics, the modulation block (or modulation operation) typically includes one or more methods for accounting for these statistical variations. For example, in some cases, the modulation block (or modulation operation) includes a computer-implemented process that normalizes the features using batch normalization or instance normalization. In some embodiments, modulation is achieved by scaling and shifting the normalized activation according to affine parameters predicted from the input conditions. Similarly, some modulation processes replace feature normalization with demodulation processes. Thus, in one or more embodiments, the modulation block (or modulation operation) includes a modulation layer, a convolution layer, and a demodulation layer. For example, in one or more embodiments, the modulation block (or modulation operation) includes a modulation method described by Tero Karras, samuli Laine, miika Aittala, janne Hellsten, jaakko Lehtin, and Timo Aila in "Analyzing and improving the image quality of StyleGA" (analysis and improvement of StyleGA image quality), proc.CVPR (2020) (hereinafter StyleGan 2), which is incorporated herein by reference in its entirety. In some cases, the modulation block includes one or more modulation operations.

Further, in one or more embodiments, the global modulation block (or global modulation operation) comprises a modulation block (or modulation operation) that modulates the input signal in a spatially invariant manner. For example, in some embodiments, the global modulation block (or global modulation operation) performs modulation according to global features of the digital image (e.g., without spatially varying across the coordinates of the feature map or image). Thus, for example, a global modulation block comprises a modulation block that modulates an input signal according to an image code (e.g., global feature code) generated by an encoder. In some implementations, the global modulation block includes a plurality of global modulation operations.

In one or more embodiments, the spatial modulation block (or spatial modulation operation) includes a modulation block (or modulation operation) that modulates the input signal in a spatially varying manner (e.g., according to a spatially varying signature). In particular, in some embodiments, the spatial modulation block (or spatial modulation operation) modulates the input signal in a spatially varying manner using a spatial tensor. Thus, in one or more embodiments, the global modulation block applies global modulation in which affine parameters are consistent in spatial coordinates, and the spatial modulation block applies spatially varying affine transforms that vary in spatial coordinates. In some embodiments, the spatial modulation block includes spatial modulation operations (e.g., global modulation operations and spatial modulation operations) combined with another modulation operation.

For example, in some embodiments, the spatial modulation operations include spatial adaptive modulation, as described in Taesung Park, ming-Yu Liu, ting-Chun Wang, and Jun-Yan Zhu in "Semantic image synthesis with spatially-adaptive normalization" (semantic image synthesis based on spatial adaptive normalization), IEEE computer vision and Pattern recognition conference (2019), which is incorporated herein by reference in its entirety (hereinafter Taesung). In some embodiments, the spatial modulation operation utilizes a spatial modulation operation having a different architecture than Taesung, including a modulation-convolution-demodulation pipeline.

Thus, with respect to FIG. 6, scene-based image editing system 106 utilizes global modulation block 602. As shown, the global modulation block 602 includes a first global modulation operation 604 and a second global modulation operation 606. Specifically, the first global modulation operation 604 processes the global feature map 612. For example, global feature map 612 includes feature vectors generated by cascade modulation repair neural networks that reflect global features (e.g., high-level features or features corresponding to an entire digital image). Thus, for example, global feature map 612 includes feature vectors reflecting global features generated from previous global modulation blocks of the concatenated decoder layer. In some cases, global feature map 612 also includes feature vectors corresponding to the encoded feature vectors generated by the encoder (e.g., at the first decoder layer, in various implementations, scene-based image editing system 106 utilizes encoded feature vectors, style codes, global feature codes, constants, noise vectors, or other feature vectors as inputs).

As shown, the first global modulation operation 604 includes a modulation layer 604a, an upsampling layer 604b, a convolution layer 604c, and a normalization layer 604d. In particular, scene-based image editing system 106 utilizes modulation layer 604a to perform global modulation of global feature map 612 based on global feature map 614 (e.g., global feature map 516). In particular, scene-based image editing system 106 applies a neural network layer (i.e., a fully connected layer) to global feature codes 614 to generate global feature vectors 616. The scene-based image editing system 106 then modulates the global feature map 612 with the global feature vector 616.

In addition, the scene-based image editing system 106 applies the upsampling layer 604b (e.g., to modify the resolution scale). In addition, scene-based image editing system 106 applies convolutional layer 604c. In addition, the scene-based image editing system 106 applies the normalization layer 604d to complete the first global modulation operation 604. As shown, the first global modulation operation 604 generates global intermediate features 618. In particular, in one or more embodiments, the scene-based image editing system 106 generates the global intermediate feature 618 by combining (e.g., concatenating) the output of the first global modulation operation 604 with the encoded feature vector 620 (e.g., from a convolutional layer of an encoder with a matching scale/resolution).

As shown, the scene-based image editing system 106 also utilizes a second global modulation operation 606. In particular, scene-based image editing system 106 applies second global modulation operation 606 to global intermediate feature 618 to generate new global feature map 622. In particular, scene-based image editing system 106 applies global modulation layer 606a to global intermediate features 618 (e.g., conditional on global feature vector 616). In addition, scene-based image editing system 106 applies convolution layer 606b and normalization layer 606c to generate new global feature map 622. As shown, in some embodiments, scene-based image editing system 106 applies spatial bias when generating new global feature map 622.

Further, as shown in fig. 6, scene-based image editing system 106 utilizes spatial modulation block 603. Specifically, spatial modulation block 603 includes global modulation operation 608 and spatial modulation operation 610. Global modulation operation 608 processes local feature map 624. For example, the local feature map 624 includes feature vectors generated by a cascade modulation repair neural network that reflect local features (e.g., low-level, specific, or spatially varying features). Thus, for example, the local feature map 624 includes feature vectors reflecting local features generated from previous spatial modulation blocks of the concatenated decoder layer. In some cases, global feature map 612 also includes feature vectors corresponding to the encoded feature vectors generated by the encoder (e.g., at the first decoder layer, in various implementations, scene-based image editing system 106 utilizes encoded feature vectors, style codes, noise vectors, or other feature vectors).

As shown, scene-based image editing system 106 generates local intermediate features 626 from local feature map 624 using global modulation operation 608. Specifically, scene-based image editing system 106 applies modulation layer 608a, upsampling layer 608b, convolution layer 608c, and normalization layer 608d. Furthermore, in some embodiments, scene-based image editing system 106 applies spatial bias and broadcast noise to the output of global modulation operation 608 to generate local intermediate features 626.

As shown in fig. 6, scene-based image editing system 106 utilizes spatial modulation operation 610 to generate a new local feature map 628. In effect, spatial modulation operation 610 modulates local intermediate features 626 based on global intermediate features 618. In particular, scene-based image editing system 106 generates spatial tensors 630 from global intermediate features 618. For example, scene-based image editing system 106 applies a convolved affine parameter network to generate spatial tensor 630. In particular, the scene-based image editing system 106 applies a convolved affine parameter network to generate the intermediate space tensor. Scene-based image editing system 106 combines the intermediate spatial tensors with global feature vector 616 to generate spatial tensors 630. The scene-based image editing system 106 modulates the local intermediate features 626 (with the spatial modulation layer 610 a) with the spatial tensor 630 and generates a modulated tensor.

As shown, scene-based image editing system 106 also applies convolutional layer 610b to the modulated tensor. Specifically, convolution layer 610b generates a convolution feature representation from the modulated tensor. In addition, scene-based image editing system 106 applies normalization layer 610c to the convolved feature representations to generate new local feature maps 628.

Although shown as normalization layer 610c, in one or more embodiments, scene-based image editing system 106 applies a demodulation layer. For example, the scene-based image editing system 106 applies a modulation-convolution-demodulation pipeline (e.g., general normalization rather than instance normalization). In some cases, the method avoids potential artifacts (e.g., water drop artifacts) caused by instance normalization. In effect, the demodulation/normalization layers include layers that scale each output feature map by a uniform demodulation/normalization value (e.g., normalized by a uniform standard deviation rather than using a constant normalized instance that depends on the data based on the content of the feature map).

As shown in fig. 6, in some embodiments, scene-based image editing system 106 also applies shift tensor 632 and broadcast noise to the output of spatial modulation operation 610. For example, spatial modulation operation 610 generates normalized/demodulated features. The scene based image editing system 106 also generates a shift tensor 632 by applying the affine parameter network to the global intermediate feature 618. The scene-based image editing system 106 combines the normalized/demodulated features, shift tensors 632, and/or broadcast noise to generate a new local feature map 628.

In one or more embodiments, upon generating the new global feature map 622 and the new local feature map 628, the scene-based image editing system 106 proceeds to the next cascaded modulation layer in the decoder. For example, the scene-based image editing system 106 uses the new global feature map 622 and the new local feature map 628 as input features to additional cascaded modulation layers at different scales/resolutions. The scene-based image editing system 106 also utilizes additional cascaded modulation layers to generate additional feature maps (e.g., utilizing additional global modulation blocks and additional spatial modulation blocks). In some cases, scene-based image editing system 106 iteratively processes feature maps using cascaded modulation layers until a final scale/resolution is reached to generate a restored digital image.

Although fig. 6 shows a global modulation block 602 and a spatial modulation block 603, in some embodiments, scene-based image editing system 106 utilizes a global modulation block followed by another global modulation block. For example, scene-based image editing system 106 replaces spatial modulation block 603 with an additional global modulation block. In such an embodiment, scene-based image editing system 106 replaces the APN (and spatial tensor) and corresponding spatial modulations shown in fig. 6 with a skip connection (skip connection). For example, scene-based image editing system 106 utilizes global intermediate features to perform global modulation with respect to local intermediate vectors. Thus, in some cases, scene-based image editing system 106 utilizes a first global modulation block and a second global modulation block.

As described above, the decoder can also be described in terms of variables and equations to illustrate the operation of the cascade modulation repair neural network. For example, as described above, the decoder stacks a series of cascaded modulation blocks to input a feature mapUp-sampling is performed. Each cascade modulation block takes as input a global title g to modulate a feature according to a global representation of a local image. Furthermore, in some cases, scene-based image editing system 106 provides a mechanism to correct local errors after predicting global structures.

In particular, in some embodiments, scene-based image editing system 106 utilizes cascaded modulation blocks to address the challenges of globally and locally generating coherent features. At a high level, the scene-based image editing system 106 follows the following method: i) Decomposing the global and local features to separate the local detail from the global structure, ii) predicting a cascade of global and spatial modulations of the local detail from the global structure. In one or more implementations, the scene-based image editing system 106 utilizes spatial modulation generated from global title to make better predictions (e.g., and discard instance normalization to make the design StyleGAN2 compatible).

In particular, cascade modulation with global and local features from previous scales And->And global title g as input and generates new global and local features of the next scale/resolution +.>And->To get from->Generating a new global titleThe scene-based image editing system 106 uses a global title modulation phase comprising a modulation-convolution-demodulation process that generates the upsampled feature X.

Since the global vector g has limited expressive power in representing 2-d visual detail and the features inside and outside the hole are not consistent, global modulation may produce distortion features that are not consistent with context. To compensate, in some cases, scene-based image editing system 106 utilizes spatial modulation that generates more accurate features. Specifically, the spatial modulation uses X as a spatial code and g as a global code, and the input local characteristic is modulated in a spatially adaptive manner

Furthermore, the scene-based image editing system 106 utilizes a unique spatial modulation-demodulation mechanism to avoid potential "water drop" artifacts caused by instance normalization in conventional systems. As shown, the spatial modulation follows a modulation-convolution-demodulation pipeline.

Specifically, for spatial modulation, scene-based image editing system 106 generates spatial tensor A from feature X over a layer 2 convolved Affine Parameter Network (APN) ₀ =apn (Y). Meanwhile, the scene-based image editing system 106 generates a global vector α=fc (g) from the global object g having the full connection layer (fc) to capture the global context. Scene-based image editing system 106 generates final spatial tensor a=a ₀ +α as A ₀ And α, for scaling the intermediate feature Y of the block with the element-wise product +.:

further, for convolution, the tensor is modulatedConvolving with a 3 x 3 learnable kernel K, the result is:

for spatially aware demodulation, scene-based image editing system 106 applies demodulation steps to calculate normalized outputsSpecifically, the scene-based image editing system 106 assumes that the input feature Y is an independent random variable with unit variance, and that after modulation, the desired variance of the output does not change, i.e. +.>Thus, this gives a demodulation calculation:

wherein the method comprises the steps ofIs the demodulation coefficient. In some cases, the scene-based image editing system 106 implements the foregoing equations using standard tensor operations.

In one or more implementations, the scene-based image editing system 106 also adds spatial bias and broadcast noise. For example, scene-based image editing system 106 will normalize featuresAdding a shift tensor b=apn (X) generated from another Affine Parameter Network (APN) from feature X to the broadcast noise n to generate a new local feature Sign F _l ⁽ⁱ⁺¹⁾ ：

Thus, in one or more embodiments, to generate content fills with replacement pixels for a digital image with replacement regions, the scene-based image editing system 106 generates an encoded signature from the digital image using an encoder of a content-aware hole-filling machine learning model (e.g., a cascade modulation repair neural network). Scene-based image editing system 106 also generates content fills of the replacement regions using the decoder of the content-aware hole-filling machine learning model. In particular, in some embodiments, scene-based image editing system 106 generates content fills of replacement regions of the digital image using local and global feature maps from one or more decoding layers of the hole-filling machine learning model of content perception.

As discussed above with reference to fig. 3-6, in one or more embodiments, the scene-based image editing system 106 utilizes a segmented neural network to generate object masks for objects depicted in the digital image, and a content-aware hole-filling machine learning model to generate content fills for those objects (e.g., for object masks generated for the objects). As further mentioned, in some embodiments, the scene-based image editing system 106 generates object mask(s) and content fill(s) to anticipate one or more modifications to the digital image-before user input for such modifications is received. For example, in one or more implementations, upon opening, accessing, or displaying the digital image 706, the scene-based image editing system 106 automatically generates object mask(s) and content fill(s) (e.g., without user input to do so). Thus, in some implementations, the scene-based image editing system 106 facilitates object-aware modification of digital images. FIG. 7 shows a diagram for generating object masking and content filling to facilitate object-aware modification of a digital image in accordance with one or more embodiments.

In one or more embodiments, the object-aware modification includes an editing operation that targets an object identified in the digital image. In particular, in some embodiments, the object-aware modification includes an editing operation that targets previously segmented objects. For example, as discussed, in some implementations, prior to receiving user input for modifying an object, the scene-based image editing system 106 generates a mask for the object depicted in the digital image. Thus, when the user selects an object (e.g., the user selects at least some pixels depicting the object), the scene-based image editing system 106 determines to target modification to the entire object, rather than requiring the user to specifically specify each pixel to edit. Thus, in some cases, object-aware modifications include modifications that target an object by managing all pixels that render the object as part of a coherent unit, rather than a single element. For example, in some implementations, object-aware modifications include, but are not limited to, move operations or delete operations.

As shown in fig. 7, the scene-based image editing system 106 utilizes a segmented neural network 702 and a hole-filling machine learning model 704 for content perception to analyze/process digital images 706. The digital image 706 depicts a plurality of objects 708a-708d against a background. Thus, in one or more embodiments, the scene-based image editing system 106 utilizes the segmented neural network 702 to identify objects 708a-708d within the digital image.

In one or more embodiments, the scene-based image editing system 106 utilizes a segmented neural network 702 and a content-aware hole-filling machine learning model 704 to analyze the digital image 706 in anticipation of receiving user input for modification of the digital image 706. Indeed, in some cases, scene-based image editing system 106 analyzes digital image 706 prior to receiving user input for such modification. For example, in some embodiments, the scene-based image editing system 106 automatically analyzes the digital image 706 in response to receiving or otherwise accessing the digital image 706. In some implementations, the scene-based image editing system 106 analyzes the digital image in response to general user input to initiate preprocessing in anticipation of subsequent modifications.

As shown in FIG. 7, the scene-based image editing system 106 utilizes the segmented neural network 702 to generate object masks 710 for objects 708a-708d depicted in the digital image 706. In particular, in some embodiments, the scene-based image editing system 106 utilizes a segmented neural network 702 to generate a separate object mask for each depicted object.

As further shown in FIG. 7, scene-based image editing system 106 utilizes content-aware hole-filling machine learning model 704 to generate content fills 712 for objects 708a-708 d. Specifically, in some embodiments, scene-based image editing system 106 utilizes content-aware hole-filling machine learning model 704 to generate separate content fills for each depicted object. As shown, scene-based image editing system 106 generates content fill 712 using object mask 710. For example, in one or more embodiments, the scene-based image editing system 106 utilizes the object mask 710 generated via the segmented neural network 702 as an indicator of a replacement region to be replaced with the content fill 712 generated by the content-aware hole-fill machine learning model 704. In some cases, scene-based image editing system 106 filters objects from digital image 706 using object mask 710, which results in remaining holes in digital image 706 being filled with content fill 712.

As shown in fig. 7, scene-based image editing system 106 utilizes object mask 710 and content fill 712 to generate a complete background 714. In one or more embodiments, the complete background image includes a set of background pixels with objects replaced with content fills. In particular, the complete background comprises the background of the digital image with the replacement of the object depicted within the digital image with the corresponding content fill. In one or more implementations, the complete context includes generating a content fill for each object in the image. Thus, when objects are in front of each other, the complete background may include various levels of completion such that the background for the first object includes a portion of the second object, and the background of the second object includes semantic regions or furthest elements in the image.

In effect, FIG. 7 shows a background 716 of the digital image 706 having holes 718a-718d, in which objects 708a-708d are depicted. For example, in some cases, scene-based image editing system 106 filters out objects 708a-708d using object mask 710, leaving holes 718a-718 d. In addition, scene-based image editing system 106 fills holes 718a-718d with content fill 712, thereby creating a complete background 714.

In other implementations, the scene-based image editing system 106 utilizes the object mask 710 as an indicator of the replacement area in the digital image 706. In particular, the scene-based image editing system 106 utilizes the object mask 710 as an indicator of potential replacement areas that may be generated from receiving user input to modify the digital image 706 by moving/removing one or more of the objects 708a-708d. Thus, scene-based image editing system 106 replaces pixels indicated by object mask 710 with content fill 712.

Although fig. 7 indicates that separate complete backgrounds are generated, it should be appreciated that in some implementations, the scene-based image editing system 106 creates the complete background 714 as part of the digital image 706. For example, in one or more embodiments, the scene-based image editing system 106 locates the content pad 712 behind its corresponding object in the digital image 706 (e.g., as a separate image layer). Furthermore, in one or more embodiments, scene-based image editing system 106 locates object mask 710 behind its corresponding object (e.g., as a separate layer). In some implementations, the scene-based image editing system 106 places the content fill 712 after the object mask 710.

Further, in some implementations, the scene-based image editing system 106 generates a plurality of filled-in (e.g., semi-complete) backgrounds for the digital image. For example, in some cases where the digital image depicts multiple objects, the scene-based image editing system 106 generates a filler background for each object from the multiple objects. To illustrate, the scene-based image editing system 106 generates a filled background for other objects of the digital image by generating a content fill for the object while the object is considered part of the background. Thus, in some cases, the content fill includes a portion of other objects that follow the object within the digital image.

Thus, in one or more embodiments, as shown in FIG. 7, the scene-based image editing system 106 generates a combined image 718. In effect, the scene-based image editing system 106 generates a combined image having the digital image 706, the object mask 710, and the content fill 712 as separate layers. Although FIG. 7 shows object masks 710 over objects 708a-708d within combined image 718, it should be appreciated that in various implementations, scene-based image editing system 106 places object masks 710 and content fills 712 after objects 708a-708 d. Thus, the scene-based image editing system 106 presents the combined image 718 for display within the graphical user interface such that the object mask 710 and the content fill 712 are hidden from view until a user interaction is received that triggers the display of those components.

Further, while fig. 7 shows the combined image 718 separate from the digital image 706, it should be appreciated that in some implementations, the combined image 718 represents a modification to the digital image 706. In other words, in some embodiments, to generate the combined image 718, the scene-based image editing system 106 modifies the digital image 706 by adding additional layers consisting of object masks 710 and content fills 712.

In one or more embodiments, the scene-based image editing system 106 utilizes the combined image 718 (e.g., the digital image 706, the object mask 710, and the content fill 712) to facilitate various object-aware modifications with respect to the digital image 706. In particular, the scene-based image editing system 106 utilizes the combined image 718 to implement an efficient graphical user interface that facilitates flexible object-aware modification. Fig. 8A-8D illustrate a graphical user interface implemented by the scene-based image editing system 106 to facilitate mobile operations in accordance with one or more embodiments.

In effect, as shown in FIG. 8A, the scene-based image editing system 106 provides a graphical user interface 802 for display on a client device 804, such as a mobile device. In addition, scene-based image editing system 106 provides digital image 806 for display with a graphical user interface.

It should be noted that the graphical user interface 802 of fig. 8A is stylistically simple. In particular, the graphical user interface 802 does not include a number of menus, options, or other visual elements in addition to the digital image 806. While the graphical user interface 802 of fig. 8A does not display menus, options, or other visual elements other than the digital image 806, it should be appreciated that the graphical user interface 802 displays at least some menus, options, or other visual elements in various embodiments—at least when the digital image 806 is initially displayed.

As further shown in FIG. 8A, the digital image 806 depicts a plurality of objects 808A-808d. In one or more embodiments, the scene-based image editing system 106 pre-processes the digital image 806 before receiving user input for a move operation. In particular, in some embodiments, scene-based image editing system 106 utilizes a segmented neural network to detect and generate masking of multiple objects 808a-808d and/or utilizes a hole-filling machine learning model of content perception to generate content fills corresponding to objects 808a-808d. Further, in one or more implementations, scene-based image editing system 106 generates object masking, content filling, and combined images when loading, accessing, or displaying digital image 806, and no user input is required other than to open/display digital image 806.

As shown in fig. 8B, scene-based image editing system 106 detects user interactions with object 808d via graphical user interface 802. In particular, fig. 8B illustrates that in various embodiments, the scene-based image editing system 106 detects user interactions (e.g., touch interactions) performed by a user's finger (a portion of a hand 810), although the user interactions are performed by other tools (e.g., a stylus or pointer controlled by a mouse or touch pad). In one or more embodiments, scene-based image editing system 106 determines that object 808d has been selected for modification based on user interaction.

In various embodiments, scene-based image editing system 106 detects user interactions for selecting object 808d via various operations. For example, in some cases, scene-based image editing system 106 detects the selection via a single tap (or click) on object 808 d. In some implementations, the scene-based image editing system 106 detects the selection of the object 808d via a double tap (or double tap) or press and hold operation. Thus, in some cases, scene-based image editing system 106 confirms the user's selection of object 808d with a second click or hold operation.

In some cases, scene-based image editing system 106 utilizes various interactions to distinguish single-object selections or multi-object selections. For example, in some cases, scene-based image editing system 106 determines a single tap for selecting a single object and a double tap for selecting multiple objects. To illustrate, in some cases, upon receiving a first tap of an object, the scene-based image editing system 106 selects the object. Further, upon receiving the second tap of the object, the scene-based image editing system 106 selects one or more additional objects. For example, in some implementations, the scene-based image editing system 106 selects one or more additional objects having the same or similar classifications (e.g., selects other people depicted in the image when the first tap interacts with a person in the image). In one or more embodiments, if the second tap is received within a threshold period of time after the first tap is received, the scene-based image editing system 106 identifies the second tap as an interaction for selecting the plurality of objects.

In some embodiments, scene-based image editing system 106 identifies other user interactions for selecting a plurality of objects within a digital image. For example, in some implementations, the scene-based image editing system 106 receives a drag action across a display of a digital image and selects all objects captured within the range of the drag action. To illustrate, in some cases, the scene-based image editing system 106 draws a box that grows with drag motion and selects all objects that fall within the box. In some cases, the scene-based image editing system 106 draws a line along the path of the drag motion and selects all objects that are intercepted by the line.

In some implementations, the scene-based image editing system 106 also allows user interaction to select different portions of an object. To illustrate, in some cases, upon receiving a first tap of an object, the scene-based image editing system 106 selects the object. Further, upon receiving the second tap on the object, the scene-based image editing system 106 selects a particular portion of the object (e.g., a limb or torso of a human or a component of a vehicle). In some cases, scene-based image editing system 106 selects the portion of the object touched by the second tap. In some cases, scene-based image editing system 106 enters a "child object" mode upon receiving the second tap and utilizes additional user interactions to select a particular portion of the object.

Returning to FIG. 8B, as shown, based on detecting a user interaction for selecting object 808d, scene-based image editing system 106 provides visual indication 812 associated with object 808 d. Indeed, in one or more embodiments, scene-based image editing system 106 detects a user interaction with a portion of object 808 d-e.g., with a subset of pixels depicting the object-and determines that the user interaction targets object 808d as a whole (rather than the particular pixel with which the user interacted). For example, in some embodiments, scene-based image editing system 106 utilizes a pre-generated object mask corresponding to object 808d to determine whether the user interaction targets object 808d or some other portion of digital image 806. For example, in some cases, upon detecting a user interaction with an area within the object mask that corresponds to object 808d, scene-based image editing system 106 determines that the user interaction targets object 808d as a whole. Thus, scene-based image editing system 106 provides visual indication 812 associated with object 808d as a whole.

In some cases, scene-based image editing system 106 indicates via graphical user interface 802 that the selection of object 808d has been registered using visual indication 812. In some implementations, the scene-based image editing system 106 utilizes the visual indication 812 to represent a pre-generated object mask corresponding to the object 808 d. Indeed, in one or more embodiments, in response to detecting a user interaction with object 808d, scene-based image editing system 106 presents a corresponding object mask. For example, in some cases, scene-based image editing system 106 presents object masks to prepare modifications to object 808d and/or to indicate that the object masks have been generated and are available for use. In one or more embodiments, rather than using visual indication 812 to represent the presentation of the object mask, scene-based image editing system 106 displays the object mask itself via graphical user interface 802.

Further, because scene-based image editing system 106 generates object masks for object 808d prior to receiving user input selecting object 808d, scene-based image editing system 106 presents visual indication 812 without the latency or delay associated with conventional systems. In other words, scene-based image editing system 106 presents visual indication 812 without any delay associated with generating object masks.

As further shown, based on detecting a user interaction for selecting object 808d, scene-based image editing system 106 provides an options menu 814 for display via graphical user interface 802. The options menu 814 shown in fig. 8B provides a plurality of options, although in various embodiments the options menu includes various numbers of options. For example, in some implementations, the options menu 814 includes one or more select options, such as the option that is determined to be popular or most frequently used. For example, as shown in FIG. 8B, the options menu 814 includes an option 816 to delete object 808 d.

Thus, in one or more embodiments, scene-based image editing system 106 provides modification options for display via graphical user interface 802 based on the context of user interactions. In effect, as just discussed, the scene-based image editing system 106 provides an options menu that provides options for interacting (e.g., modifying) with the selected object. In so doing, the scene-based image editing system 106 minimizes screen clutter typical under many conventional systems by retaining options or menus for display until it is determined that those options or menus will be useful in the current context of user interaction with digital images. Thus, the graphical user interface 802 used by the scene-based image editing system 106 allows for more flexible implementation on computing devices having relatively limited screen space, such as smartphones or tablet devices.

As shown in fig. 8C, scene-based image editing system 106 detects additional user interaction (as shown via arrow 818) via graphical user interface 802 for moving object 808d over digital image 806. In particular, scene-based image editing system 106 detects additional user interactions for moving object 808d from a first location to a second location in digital image 806. For example, in some cases, the scene-based image editing system 106 detects a second user interaction via a drag motion (e.g., a user input selects object 808d and moves across the graphical user interface 802 while remaining on object 808 d). In some implementations, following the initial selection of object 808d, scene-based image editing system 106 detects the additional user interaction as a click or tap on the second location and determines to use the second location as a new location for object 808d. It should be noted that scene-based image editing system 106 moves object 808d as a whole in response to additional user interactions.

As shown in fig. 8C, upon moving object 808d from the first position to the second position, scene-based image editing system 106 exposes (expose) content fill 820 placed after object 808d (e.g., after the corresponding object masks). In effect, as previously described, the scene-based image editing system 106 places the pre-generated content fills after the objects (or corresponding object masks) for which the content fills were generated. Thus, upon removal of object 808d from its initial position within digital image 806, scene-based image editing system 106 automatically displays the corresponding content fill. Thus, the scene-based image editing system 106 provides a seamless experience that objects can move without exposing any holes in the digital image itself. In other words, the scene-based image editing system 106 provides the digital image 806 for display as if it were a real scene in which the entire background is known.

Further, because scene-based image editing system 106 generated content fill 820 for object 808d prior to receiving user input of moving object 808d, scene-based image editing system 106 exposes or renders content fill 820 without the latency or delay associated with conventional systems. In other words, as object 808d moves over digital image 806, scene-based image editing system 106 incrementally exposes content fill 820 without any delay associated with generating content.

As further shown in fig. 8D, the scene based image editing system 106 deselects the object 808D upon completion of the move operation. In some embodiments, object 808d maintains selection of object 808d until another user interaction is received (e.g., a user interaction with another portion of digital image 806) indicating deselection of object 808d. As further indicated, upon deselection of object 808d, scene-based image editing system 106 further removes previously presented menu of options 814. Thus, the scene-based image editing system 106 dynamically presents options for interacting with objects for display via the graphical user interface 802 to maintain a displayed reduced style that does not occupy (overlap) a computing device having limited screen space.

9A-9C illustrate a graphical user interface implemented by the scene-based image editing system 106 to facilitate deletion operations in accordance with one or more embodiments. In effect, as shown in FIG. 9A, the scene-based image editing system 106 provides a graphical user interface 902 for display on a client device 904 and provides a digital image 906 for display in the graphical user interface 902.

As further shown in fig. 9B, the scene-based image editing system 106 detects a user interaction with an object 908 depicted in the digital image 906 via the graphical user interface 902. In response to detecting the user interaction, the scene-based image editing system 106 presents the corresponding object mask, providing a visual indication 910 (or the object mask itself) to display in association with the object 908, and providing an options menu 912 for display. Specifically, as shown, the options menu 912 includes options 914 for deleting the object 908 that has been selected.

Further, as shown in fig. 9C, scene-based image editing system 106 removes object 908 from digital image 906. For example, in some cases, the scene-based image editing system 106 detects additional user interactions (e.g., interactions with the option 914 for deleting the object 908) via the graphical user interface 902 and in response removes the object 908 from the digital image 906. As further shown, upon removal of the object 908 from the digital image 906, the scene-based image editing system 106 automatically exposes content fills 916 that were previously placed behind the object 908 (e.g., after the corresponding object mask). Thus, in one or more embodiments, scene-based image editing system 106 provides content fill 916 for immediate display upon removal of object 908.

Although fig. 8B, 8C, and 9B illustrate a scene-based image editing system 106 providing a menu, in one or more implementations, the scene-based image editing system 106 allows object-based editing without requiring or utilizing a menu. For example, the scene-based image editing system 106 selects the objects 808d, 908 and presents visual indications 812, 910 in response to a first user interaction (e.g., tapping on the respective objects). The scene based image editing system 106 performs object based editing of the digital image in response to the second user interaction without using a menu. For example, in response to a second user input dragging an object across an image, the scene-based image editing system 106 moves the object. Alternatively, in response to a second user input (e.g., a second tap), scene-based image editing system 106 deletes the object.

The scene-based image editing system 106 provides greater flexibility in editing digital images than conventional systems. In particular, scene-based image editing system 106 facilitates object-aware modifications that are capable of interacting with objects without targeting underlying pixels. In fact, based on the selection of some pixels that help to render an object, the scene-based image editing system 106 flexibly determines that the entire object has been selected. This is in contrast to conventional systems that require a user to select an option from a menu indicating an intent to select an object, provide a second user input indicating an object to select (e.g., a bounding box for the object or a drawing for another general boundary of the object), and another user input to generate an object mask. Instead, the scene-based image editing system 106 provides for selecting objects with a single user input (tapping on the object).

Furthermore, upon user interaction to effect the modification after the previous selection, scene-based image editing system 106 applies the modification to the entire object, rather than the particular set of pixels selected. Thus, the scene-based image editing system 106 manages objects within the digital image as objects of a real scene, which are interactive and can be handled as a coherent unit. Furthermore, as discussed, by flexibly and dynamically managing the amount of content displayed on a graphical user interface in addition to digital images, the scene-based image editing system 106 provides improved flexibility with respect to deployment on smaller devices.

Furthermore, the scene-based image editing system 106 provides improved efficiency over many conventional systems. In fact, as previously mentioned, conventional systems typically require execution of a workflow consisting of a sequence of user interactions to perform the modification. In the case where modifications are intended to target a particular object, many of these systems require several user interactions to merely indicate that the object is the subject of a subsequent modification (e.g., a user interaction to identify the object and separate the object from the rest of the image), and a user interaction to close a loop of the performed modification (e.g., filling a hole remaining after removal of the object). However, the scene-based image editing system 106 reduces the user interaction typically required for modification by preprocessing the digital image prior to receiving user input for such modification. In fact, by automatically generating object masks and content fills, the scene-based image editing system 106 eliminates the need for user interaction to perform these steps.

In one or more embodiments, the scene-based image editing system 106 performs further processing of the digital image in anticipation of modifying the digital image. For example, as previously described, in some implementations, the scene-based image editing system 106 generates a semantic scene graph from a digital image. Thus, in some cases, upon receiving one or more user interactions for modifying a digital image, scene-based image editing system 106 performs the modification using the semantic scene graph. Indeed, in many cases, scene-based image editing system 106 generates a semantic scene graph for modifying a digital image prior to receiving user input for modifying the digital image. FIGS. 10-15 illustrate diagrams of semantic scene graphs for generating digital images in accordance with one or more embodiments.

In fact, many conventional systems are inflexible in that they typically wait for user interaction before determining the characteristics of the digital image. For example, such conventional systems typically wait for a user interaction indicating a characteristic to be determined, and then perform a corresponding analysis in response to receiving the user interaction. Thus, these systems do not have the useful characteristics of being readily available. For example, upon receiving a user interaction for modifying a digital image, conventional systems typically must perform an analysis of the digital image to determine characteristics to be changed after the user interaction has been received.

Furthermore, as previously mentioned, such operations result in inefficient operation, as image editing typically requires a workflow of user interactions, many of which are used to determine the characteristics used in the execution of the modification. Thus, conventional systems typically require a significant amount of user interaction to determine the characteristics required for editing.

Scene-based image editing system 106 provides advantages by generating a semantic scene graph for a digital image in anticipation of modifications to the digital image. In fact, by generating a semantic scene graph, the scene-based image editing system 106 increases flexibility relative to conventional systems because it facilitates the use of digital image characteristics in the image editing process. Furthermore, the scene-based image editing system 106 provides improved efficiency by reducing the user interaction required to determine these characteristics. In other words, the scene-based image editing system 106 eliminates user interaction often required in the preparatory steps of editing digital images in conventional systems. Thus, the scene-based image editing system 106 enables user interaction to more directly notice the image editing itself.

Furthermore, by generating a semantic scene graph for the digital image, the scene-based image editing system 106 intelligently generates/obtains information that allows editing the image as a real world scene. For example, the scene-based image editing system 106 generates a scene graph indicating objects, object properties, object relationships, etc., which allows the scene-based image editing system 106 to implement object/scene-based image editing.

In one or more embodiments, the semantic scene graph includes a graphical representation of a digital image. In particular, in some embodiments, the semantic scene graph includes a graph that maps out characteristics of the digital image and its associated characteristic attributes. For example, in some implementations, the semantic scene graph includes a node graph having nodes representing characteristics of the digital image and values associated with nodes representing characteristic attributes of those characteristics. Furthermore, in some cases, edges between nodes represent relationships between characteristics.

As described above, in one or more implementations, the scene-based image editing system 106 utilizes one or more predetermined or pre-generated template maps in generating a semantic scene graph for a digital image. For example, in some cases, scene-based image editing system 106 utilizes an image analysis map in generating a semantic scene map. FIG. 10 illustrates an image analysis diagram 1000 used by the scene-based image editing system 106 in generating a semantic scene graph in accordance with one or more embodiments.

In one or more embodiments, the image analysis graph includes a template graph for constructing a semantic scene graph. In particular, in some embodiments, the image analysis map includes a template map that provides a structural template that is used by the scene-based image editing system 106 to organize information included in the semantic scene map. For example, in some implementations, the image analysis graph includes a template graph of nodes that indicate how to organize a semantic scene graph that represents characteristics of the digital image. In some cases, the image analysis map additionally or alternatively indicates information to be represented in the semantic scene map. For example, in some cases, the image analysis graph indicates characteristics, relationships, and characteristic properties of digital images to be represented in the semantic scene graph.

In fact, as shown in FIG. 10, the image analysis graph 1000 includes a plurality of nodes 1004a-1004g. Specifically, the plurality of nodes 1004a-1004g correspond to characteristics of the digital image. For example, in some cases, the plurality of nodes 1004a-1004g represent a category of characteristics to be determined when analyzing a digital image. In fact, as shown, the image analysis map 1000 indicates that the semantic scene map will represent objects and groups of objects within the digital image, as well as the scene of the digital image, including light sources, settings, and specific locations.

As further shown in FIG. 10, the image analysis graph 1000 includes an organization of a plurality of nodes 1004a-1004g. In particular, the image analysis graph 1000 includes edges 1006a-1006h arranged in a manner that organizes a plurality of nodes 1004a-1004g. In other words, the image analysis chart 1000 shows the relationship between the characteristic categories included therein. For example, the image analysis graph 1000 indicates that the object class represented by node 1004f is closely related to the object group class represented by node 1004g, both depicting objects depicted in the digital image.

Further, as shown in FIG. 10, the image analysis graph 1000 associates property attributes with one or more of the nodes 1004a-1004g to represent property attributes of corresponding property categories. For example, as shown, the image analysis graph 1000 associates a seasonal attribute 1008a and a temporal attribute 1008b with a setting category represented by a node 1004 c. In other words, the image analysis chart 1000 indicates that seasons and times should be determined when determining the settings of the digital image. Further, as shown, the image analysis graph 1000 associates object masks 1010a and bounding boxes 1010b with object categories represented by node 1004 f. Indeed, in some implementations, the scene-based image editing system 106 generates content, such as object masks and bounding boxes, for objects depicted in the digital image. Thus, the image analysis map 1000 indicates that the pre-generated content is to be associated with nodes representing corresponding objects within the semantic scene graph generated for the digital image.

As further shown in FIG. 10, the image analysis graph 1000 associates property attributes with one or more of the edges 1006a-1006h to represent property attributes of corresponding property relationships represented by the edges 1006a-1006 h. For example, as shown, the image analysis graph 1000 associates a property attribute 1012a with an edge 1006g that indicates that an object depicted in the digital image is to be a member of a particular object group. Further, the image analysis graph 1000 associates a characteristic attribute 1012b with an edge 1006h indicating that at least some objects depicted in the digital image have a relationship with each other. FIG. 10 illustrates a sample of relationships identified between objects in various embodiments, and additional details regarding these relationships will be discussed in more detail below.

It should be noted that the property categories and property attributes represented in fig. 10 are exemplary, and that image analysis diagram 1000 includes various property categories and/or property attributes not shown in various embodiments. Furthermore, fig. 10 shows a specific organization of the image analysis map 1000, although alternative arrangements are used in different embodiments. Indeed, in various embodiments, scene-based image editing system 106 adapts to various property categories and property attributes to facilitate subsequent generation of semantic scene graphs that support various image edits. In other words, the scene-based image editing system 106 includes those property categories and property attributes that it determines to be useful in editing digital images.

Further, in one or more embodiments, scene-based image editing system 106 generates an image analysis map, such as image analysis map 1000 of FIG. 10, for generating a semantic scene map of a digital image. Indeed, in some embodiments, scene-based image editing system 106 generates (e.g., pre-generated prior to analyzing the digital images to generate a semantic scene graph) an image analysis graph that provides a structural template of nodes and edges corresponding to features that may be represented in one or more digital images. For example, in one or more embodiments, the scene-based image editing system 106 generates an image analysis graph by generating template nodes corresponding to scene components (e.g., scenery, location, time of day, illumination sources, etc.) that may be depicted in a digital image. In addition, the scene-based image editing system 106 generates template nodes corresponding to objects that may be depicted in the digital image. In some cases, the object is a particular instance of a scene component. In some implementations, template nodes corresponding to potential objects are used as generic placement holders for objects depicted in digital images. In some embodiments, scene-based image editing system 106 then generates edges that connect the template nodes created for the image analysis graph.

In some embodiments, the scene-based image editing system 106 utilizes the real-world category description map to generate a semantic scene map for the digital image. FIG. 11 illustrates a real world category description map 1102 used by the scene-based image editing system 106 in generating a semantic scene map in accordance with one or more embodiments.

In one or more embodiments, the real world category description map includes a template map describing scene components (e.g., semantic regions) that may be depicted in a digital image. In particular, in some embodiments, the real world category description map includes a template map that is used by the scene-based image editing system 106 to provide context information to the semantic scene map regarding scene components (such as objects) that may be depicted in the digital image. Indeed, in some implementations, the real-world class descriptive map provides context information related to semantic regions (e.g., objects) of potential representation in the digital image. For example, in some implementations, the real world category description map provides a hierarchy of object classification and/or parsing (e.g., object components) of certain objects that may be depicted in the digital image. In some cases, the real world category description map also includes object attributes associated with the objects represented therein. For example, in some cases, the real world category description map provides object properties assigned to a given object, such as shape, color, material from which the object is made, weight of the object, weight that the object can support, and/or various other properties determined to be useful in subsequently modifying the digital image. Indeed, as will be discussed, in some cases, the scene-based image editing system 106 utilizes the semantic scene graph of the digital image to suggest certain edits or to suggest avoiding certain edits to maintain consistency of the digital image relative to the context information contained in the real-world category description graph that built the semantic scene graph.

As shown in FIG. 11, the real world category description graph 1102 includes a plurality of nodes 1104a-1104h and a plurality of edges 1106a-1106e connecting some of the nodes 1104a-1104 h. In particular, in contrast to the image analysis diagram 1000 of fig. 10, the real world category description diagram 1102 does not provide a single network of interconnected nodes. Conversely, in some implementations, the real-world category description map 1102 includes a plurality of clusters of nodes 1108a-1108c that are separate and distinct from each other.

In one or more embodiments, each node cluster corresponds to a separate scene component (e.g., semantic region) category that may be depicted in the digital image. In fact, as shown in FIG. 11, each of the node clusters 1108a-1108c corresponds to a separate object category that may be depicted in a digital image. As described above, in various embodiments, the real world category description map 1102 is not limited to representing object categories, and may represent other scene component categories.

As shown in FIG. 11, each of the node clusters 1108a-1108c depicts a hierarchy of class descriptions (alternatively referred to as a hierarchy of object classifications) corresponding to the represented object classes. In other words, each of the node clusters 1108a-1108c depicts a degree of specificity/versatility that describes or marks objects. Indeed, in some embodiments, the scene-based image editing system 106 applies all of the category descriptions/tags represented in the node cluster to describe the corresponding objects depicted in the digital image. However, in some implementations, the scene-based image editing system 106 describes objects with a subset of category descriptions/tags.

As an example, node cluster 1108a includes node 1104a representing a side table category and node 1104b representing a table category. Further, as shown in fig. 11, node cluster 1108a includes an edge 1106a between node 1104a and node 1104b to indicate that the side table category is a sub-category of the table category, thereby indicating a hierarchy between the two categories applicable to the side table. In other words, node cluster 1108a indicates that the side tables may be classified as side tables and/or more generally as tables. In other words, in one or more embodiments, upon detecting a side table depicted in the digital image, the scene-based image editing system 106 marks the side table as a side table and/or as a table based on the hierarchical structure represented in the real-world category description map 1102.

The degree to which a cluster of nodes represents a hierarchy of class descriptions varies in various embodiments. In other words, the length/height of the hierarchical structure represented is different in various embodiments. For example, in some implementations, node cluster 1108a also includes nodes representing furniture categories, indicating that side tables may be categorized as furniture. In some cases, node cluster 1108a also includes nodes representing inanimate objects, indicating that the side tables may be so classified. Further, in some implementations, node cluster 1108a includes nodes that represent categories of entities, indicating that a side table may be classified as an entity. Indeed, in some implementations, the hierarchy of category descriptions represented within the real-world category description graph 1102 includes category descriptions/labels-e.g., entity categories-in such a high level of versatility that it generally applies to all objects represented within the real-world category description graph 1102.

As further shown in fig. 11, node cluster 1108a includes a parsing of the represented object categories (e.g., object components). In particular, node cluster 1108a includes a representation of an element of a table category for an object. For example, as shown, node cluster 1108a includes node 1104c representing a category of legs. In addition, node cluster 1108a includes an edge 1106b that indicates that a leg from the leg category is part of a table from the table category. In other words, the edge 1106b indicates that the leg is a component of the table. In some cases, node cluster 1108a includes additional nodes for representing other components that are part of a table, such as a desktop, a table board (leaf), or a baffle (acron).

As shown in fig. 11, the node 1104c representing the table leg object category is connected to the node 1104b representing the table object category, instead of the node 1104a representing the side table category of the object. Indeed, in some implementations, the scene-based image editing system 106 utilizes such a configuration based on determining that all tables include one or more legs. Thus, since the side tables are sub-categories of tables, the configuration of node cluster 1108a indicates that all side tables also include one or more legs. However, in some implementations, the scene-based image editing system 106 additionally or alternatively connects the node 1104c representing the category of leg objects to the node 1104a representing the side-table object class to specify that all side tables include one or more legs.

Many conventional image editing systems are inflexible and inefficient in moving objects according to the perspective of digital images. For example, while conforming objects within a digital image to an associated three-dimensional perspective is a fundamental, intuitive concept, conventional systems often fail to provide an intuitive tool for incorporating such concepts when moving objects. Rather, these systems move objects rigidly as if their digital images depict a flat two-dimensional environment. These systems typically rely on user manipulation of a moving object or surrounding environment to give an overall three-dimensional appearance. As such, conventional systems often also require specialized knowledge to properly move the object while adhering to the tools and steps required for perspective.

Many conventional image editing systems operate further inefficiently because they require a significant amount of user interaction to move objects while adhering to perspectives. For example, conventional systems typically require a first set of user interactions for moving objects and a second set of user interactions for individually resizing the objects based on the position of the objects resulting from the movement to conform to an associated perspective. Furthermore, in many cases, adjusting the size of the moving object to fit the perspective may be subject to trial and error until the size of the object appears correct, which results in redundancy in user interaction.

The scene-based image editing system 106 utilizes perspective-aware object movement operations to provide improved flexibility over conventional systems. In effect, the scene-based image editing system 106 flexibly incorporates three-dimensional perspectives into the editing process by moving objects relative to the three-dimensional perspectives of the digital images. Furthermore, by automatically resizing the object based on its motion, the scene-based image editing system 106 provides a further three-dimensional effect from perspective. Thus, the scene-based image editing system 106 flexibly folds resizing modifications into motion modifications enabling more intuitive perspective-based editing. Further, the scene-based image editing system 106 flexibly edits digital images according to the three-dimensional environment captured therein.

By implementing perspective-aware object movement operations, the scene-based image editing system 106 also provides improved efficiency over conventional systems. Indeed, where many conventional systems require a series of steps in multiple workflows to move and resize objects, respectively, the scene-based image editing system 106 operates based on a reduced set of user interactions. In particular, since perspective-aware object movement operations involve automatic resizing of objects, the scene-based image editing system 106 eliminates the separate user interaction that is typically required under conventional systems for resizing objects.

In one or more embodiments, the perspective-aware object movement operations include editing operations for moving and resizing objects within the digital image based on perspectives associated with the digital image. Indeed, in some cases, perspective-aware object movement operations include one type of object-aware modification, as it targets objects identified within the digital image. In some cases, when performing perspective-aware object movement operations, the scene-based image editing system 106 determines movement of an object relative to a perspective associated with a digital image by determining movement of the object relative to a vanishing point and/or horizontal line associated with the digital image. In some cases, perspective-aware object movement operations differ from movement operations in that movement operations allow objects to move freely within a digital image, while perspective-aware object movement operations limit movement of objects to movement toward or away from vanishing points associated with the digital image, while also resizing and/or obscuring one or more objects due to movement. For example, as will be shown below, in some implementations, perspective-aware object movement limits movement along a line extending from the vanishing point and through the object (e.g., the center or some other portion of the object).

As indicated, in some cases, perspective-aware object movement operations involve adjusting the size of an object via perspective-based sizing. In one or more embodiments, perspective-based sizing includes sizing an object based on movement of the object relative to a perspective of a digital image depicting the object. In particular, in some cases, perspective-based resizing of the object includes resizing the object based on movement of the object relative to a vanishing point and/or horizontal line associated with rendering the digital image of the object. For example, in some cases, perspective-based resizing of an object includes increasing the size of the object as the object moves away from the vanishing point of its digital image, or decreasing the size of the object as the object moves toward the vanishing point. In some cases, the scene-based image editing system 106 increases or decreases the size of the object in proportion to the distance the object moves away from or toward the vanishing point, respectively.

As described above, in one or more embodiments, scene-based image editing system 106 performs perspective-aware object movement operations on objects depicted in a digital image based on perspectives associated with the digital image. For example, in some cases, the scene-based image editing system 106 performs perspective-aware object movement operations based on vanishing points and/or horizontal lines associated with the digital image. Thus, in some implementations, the scene-based image editing system 106 determines vanishing points and/or horizontal lines of the digital image. FIG. 40 illustrates a scene-based image editing system 106 that determines vanishing points and horizontal lines of a digital image in accordance with one or more embodiments.

In fact, as shown in fig. 40, the digital image 4002 depicts a subject 4004 (a cyclist along a path or road). As further shown, scene-based image editing system 106 determines vanishing points 4006 and horizontal lines 4008 of digital image 4002. Fig. 40 shows that scene-based image editing system 106 provides visual indicators for vanishing point 4006 and horizontal line 4008. For example, in some cases, the scene-based image editing system 106 provides visual indicators for display within a graphical user interface of a client device displaying the digital image 4002. However, in some implementations, the scene-based image editing system 106 does not provide such visual indicators.

As further shown in fig. 40, the scene-based image editing system 106 utilizes the perspective detection neural network 4010 to determine vanishing points 4006 and/or horizontal lines 4008. In one or more embodiments, the perspective detection neural network includes a computer-implemented neural network for determining vanishing points and/or horizontal lines of a digital image. In particular, in some cases, the perspective detection neural network includes a computer-implemented neural network that analyzes a digital image (e.g., visual properties of the digital image) and identifies vanishing points and/or horizontal lines of the digital image based on the analysis. In some cases, the perspective detection neural network also generates visual indicators of vanishing points and/or horizontal lines for display.

In one or more embodiments, the perspective detection neural network 4010 comprises a convolutional neural network. In particular, in some embodiments, perspective detection neural network 4010 comprises a convolutional neural network having dense connections (e.g., via dense blocks) between layers. To illustrate, in one or more embodiments, the perspective detection neural network 4010 comprises a plurality of dense blocks, each dense block comprising an equal number of layers. In some cases, a convolution with multiple output channels is performed on the input image before entering the first block. In some cases, scene-based image editing system 106 zero-fills each side of the input with one or more pixels to keep the feature map size fixed. In some embodiments, the perspective detection neural network 4010 uses convolution followed by average merging as a transition layer between two adjacent dense blocks. In some implementations, the perspective detection neural network 4010 performs global average merging at the end of the last dense block, and then uses a softmax classifier. In some embodiments, the scene-based image editing system 106 replaces the final layer (e.g., softmax classification layer) of the perspective detection neural network 4010 with one or more separate heads to output vanishing points and/or horizontal lines associated with the digital image. In some cases, the perspective detection neural network 4010 utilizes separate headers to provide more detail (e.g., angle of horizontal and/or distance of horizontal from digital image center). In some implementations, the scene-based image editing system 106 utilizes the densnet described by G as perspective detection neural network 4010.Huang et al Compensely connected convolutional networks, in IEEE Conference on computer vision and pattern recognition,2016 (hereinafter Huang), incorporated herein by reference in its entirety. In some cases, scene-based image editing system 106 replaces the last layer of DenseNet described by Huan with one or more separate headers as described above.

Although fig. 40 illustrates a scene-based image editing system 106 that utilizes a perspective detection neural network 4010 to determine vanishing points 4006 and/or horizontal lines 4008, in various embodiments, the scene-based image editing system 106 utilizes other means. For example, in some cases, scene-based image editing system 106 identifies vanishing point 4006 and/or horizon 4008 based on user interactions that provide vanishing point 4006 and/or horizon 4008.

As shown in fig. 40, the scene-based image editing system 106 determines that a vanishing point 4006 and a horizontal line 4008 are present in the digital image 4002. In other words, the scene-based image editing system 106 determines that the vanishing point 4006 and the horizontal line 4008 appear within the boundaries of the digital image 4002. However, in some cases, the scene-based image editing system 106 determines that at least one of the vanishing point 4006 or the horizontal line 4008 is external to the digital image 4002.

In some cases, scene-based image editing system 106 can change vanishing point 4006 and/or horizontal line 4008. For example, in some cases, the scene-based image editing system 106 detects one or more user interactions for moving the vanishing point 4006 and/or the horizontal line 4008 (e.g., by detecting one or more user interactions with visual indicators of the vanishing point 4006 and/or the horizontal line 4008 via a graphical user interface of a client device displaying the digital image 4002). Thus, the scene-based image editing system 106 determines the modified location of the vanishing point 4006 and/or the horizontal line 4008 based on one or more user interactions. Further, in some cases, the scene-based image editing system 106 performs perspective-aware object movement operations on objects depicted within the digital image 4002 based on the modified locations.

Further, in some embodiments, the scene-based image editing system 106 determines the vanishing point 4006 and/or the horizontal line 4008 via preprocessing of the digital image 4002. In effect, as previously described, the scene-based image editing system 106 automatically pre-processes the digital image (e.g., prior to receiving user input regarding the digital image). In some cases, scene-based image editing system 106 includes vanishing points 4006 and/or horizontal lines 4008 in the semantic scene graph generated for digital image 4002. Thus, in some cases, when performing perspective-aware object movement operations, the scene-based image editing system 106 references the semantic scene graph. For example, in some cases, scene-based image editing system 106, upon detecting a user selection to perform a perspective-aware object movement operation, references the semantic scene graph to retrieve vanishing point 4006 and/or horizontal line 4008.

As described above, in some embodiments, the scene-based image editing system 106 utilizes perspective views of the digital image (e.g., vanishing points and/or horizontal lines of the digital image) to perform perspective-based resizing of objects moving relative to the perspective views within the digital image. For example, the scene-based image editing system 106 utilizes perspective to determine how to resize an object as it moves toward or away from a vanishing point associated with a digital image. FIG. 41 illustrates a perspective view of using a digital image to determine how to resize an object moving relative to the perspective view in accordance with one or more embodiments.

As shown in fig. 41, the scene-based image editing system 106 identifies a horizontal line 4102 and vanishing point 4104 of the digital image (digital image not shown). In addition, as shown, scene-based image editing system 106 identifies objects 4106 depicted within the digital image. In particular, the scene-based image editing system 106 identifies an object 4106 at a first location 4108 within the digital image (e.g., a location of the object 4106 before being moved).

As further shown in fig. 41, the scene-based image editing system 106 determines a first distance 4110 (labeled d 1) from the object 4106 at the first location 4108 to a horizontal line 4102 of the digital image. In one or more embodiments, the scene-based image editing system 106 determines the first distance 4110 as the distance between the horizontal line 4102 and the top of the object 4106, the bottom of the object 4106, or the center of the object 4106. In some cases, scene-based image editing system 106 utilizes another portion of object 4106 to determine first distance 4110, such as an average pixel coordinate pair of object 4106 or some other portion specified via user input. Further, in some embodiments, scene-based image editing system 106 measures first distance 4110 in pixels. However, in some implementations, the scene-based image editing system 106 utilizes different metrics, such as metrics provided via user input.

As shown in fig. 41, scene-based image editing system 106 also determines a second location 4112 of object 4106 within the digital image. For example, in some cases, scene-based image editing system 106 moves object 4106 from first location 4108 to second location 4112 (e.g., in response to one or more user interactions). As shown in fig. 41, movement of the object 4106 from the first position 4108 to the second position 4112 involves movement relative to the vanishing point 4104 because the object 4106 moves closer to the vanishing point 4104 (and closer to the horizontal line 4102 when the vanishing point 4104 is located on the horizontal line 4102). As further shown, scene-based image editing system 106 determines a second distance 4114 (labeled d 2) from object 4106 at second location 4112 to horizontal line 4102.

In addition, as shown in fig. 41, the scene-based image editing system 106 determines a perspective scale 4116 based on the first distance 4110 and the second distance 4114. In one or more embodiments, the perspective scaling ratio includes a metric for adjusting the size of the object based on movement of the object relative to the perspective of the digital image. In particular, in some embodiments, perspective scaling ratio includes a ratio that provides how to adjust the size of an object based on the amount by which the object moves relative to a perspective of a digital image. For example, as suggested, in some cases, the perspective scaling ratio indicates the degree to which the size of the object is increased or decreased based on the amount by which the object has moved away from or toward the vanishing point (and horizontal line) of the digital image.

In effect, as shown in FIG. 41, the scene-based image editing system 106 determines that the perspective scaling ratio 4116 includes a ratio of the second distance 4114 to the first distance 4110. When the second distance 4114 shown in fig. 41 is less than the first distance 4110, the perspective scaling ratio 4116 indicates that the object 4106 will be reduced in size. In addition, perspective scale 4116 provides a digital indication of the extent to which object 4106 is reduced in size. For example, the perspective scaling ratio 4116 indicates that the resulting size of the object 4106 at the second location 4112 is a determined ratio (or percentage or fraction) of the size of the object 4106 at the first location 4108 (the ratio is determined using the first distance 4110 and the second distance 4114). As such, in one or more embodiments, scene-based image editing system 106 reduces the size of object 4106 at second location 4112 based on perspective scaling ratio.

For clarity, fig. 41 shows an example in which object 4106 is moved toward vanishing point 4104 (and horizontal line 4102) of the digital image, causing scene-based image editing system 106 to reduce the size of object 4106 at second location 4112. In one or more embodiments, in the event that object 4106 is moved away from vanishing point 4104, scene-based image editing system 106 increases the size of object 4106 at its resulting location. In effect, scene-based image editing system 106 determines the perspective scale of object 4106 based on horizontal line 4102 and the distance of object 4106 between its first and final positions. In this case, the perspective scaling will represent an increase in the size of the object, since the distance at the final position of the object will be greater than the distance at the first position. In this way, scene-based image editing system 106 increases the size of object 4106 according to perspective scaling.

42A-42C illustrate a graphical user interface implemented by the scene-based image editing system 106 to perform perspective-aware object movement operations on objects in accordance with one or more embodiments. In effect, as shown in fig. 42A, the scene-based image editing system 106 provides a digital image 4206 depicting an object 4208 at a first location within a graphical user interface 4202 of a client device 4204 for display. As further shown, the scene-based image editing system 106 determines the vanishing points 4212 and the horizontal lines 4214 of the digital image 4206 (although as noted above, in some cases, the scene-based image editing system 106 does not actually display the vanishing points 4212 or the horizontal lines 4214).

As shown in fig. 42A, scene-based image editing system 106 detects a user interaction for selecting object 4208. In response to detecting the selection of object 4208, scene-based image editing system 106 provides selectable option 4210 for performing perspective-aware object movement operations on object 4208 for display. In some cases, scene-based image editing system 106 utilizes selectable option 4210 to visually distinguish options for performing perspective-aware object movement operations from options for performing standard movement operations. For example, in some cases, the scene-based image editing system 106 initiates perspective-aware object movement operations in response to selection of the selectable option 4210, but initiates standard movement operations in response to selection of the object 4208 itself. Thus, the scene-based image editing system 106 allows for two types of movement operations while providing a clear indication of which operation the user selected. However, in some cases, the scene-based image editing system 106 utilizes other interaction means to determine that a perceived perspective object movement operation is to be used (e.g., a double-click interaction with object 4208 or a voice input invoking a perceived perspective object movement operation).

In effect, as shown in FIG. 42B, scene-based image editing system 106 detects a user selection of selectable option 4210. Thus, scene-based image editing system 106 determines to perform perspective-aware object movement on object 4208 based on further user interactions.

As previously described, and as shown in fig. 42B, scene-based image editing system 106 determines a line 4216 extending from vanishing point 4212 to (and through) object 4208. In one or more embodiments, scene-based image editing system 106 determines line 4216 when object 4208 is selected or when selectable option 4210 is selected. In some cases, scene-based image editing system 106 determines line 4216 when preprocessing digital image 4206. In some cases, the scene-based image editing system 106 does not provide the line 4216 for display, but still uses the line 4216 as a reference to move the object 4208 relative to the vanishing point 4212. In other words, in one or more embodiments, the scene-based image editing system 106 performs a perceived perspective object movement operation by moving the object 4208 along the line 4216 toward or away from the vanishing point 4212. As previously described, in some cases, the scene-based image editing system 106 limits perspective-aware object movement operations to movement of the object 4208 along the line 4216.

In effect, as shown in fig. 42C, the scene-based image editing system 106 moves the object 4208 within the digital image 4206 relative to the vanishing point 4212. In particular, scene-based image editing system 106 moves object 4208 along line 4216 away from vanishing point 4212.

As shown in fig. 42C, as object 4208 is moved, scene-based image editing system 106 exposes a complete portion of the background to digital image 4206. In particular, where the first position of object 4208 represents an initial position of object 4208 within the digital image, the complete portion of the exposed background includes content padding generated for object 4208. Indeed, in one or more embodiments, as described above, scene-based image editing system 106 pre-processes digital image 4206 to generate a content fill of object 4208. Furthermore, in some cases, scene-based image editing system 106 locates the content fill behind object 4208 (e.g., after masking of the object). Thus, prior to moving the object, the background portion covered by the object 4208 has been filled with pre-generated content, and moving the object 4208 exposes the pre-generated content instead of the holes in the digital image 4206.

As further shown, the scene-based image editing system 106 increases the size of the object 4208 (via perspective-based sizing) based on movement of the object 4208 away from the vanishing point 4212. In one or more embodiments, the scene-based image editing system 106 resizes the object 4208 in determining a resulting position of the object 4208 after being moved. In other words, scene-based image editing system 106 determines that movement of object 4208 has ended, uses the resulting position to determine perspective scaling as described above, and uses the perspective scaling to modify the size of object 4208. However, in some cases, the scene-based image editing system 106 resizes the object 4208 in real-time as the object 4208 is moved. Indeed, in some cases, scene-based image editing system 106 continuously adjusts the size of object 4208 as object 4208 moves along line 4216, allowing a user to see how the size of object 4208 changes with movement.

The above discussion of fig. 40-42C provides details of the scene-based image editing system 106 adjusting the size of objects within the digital image to move the objects relative to vanishing points of the digital image in response to user input. However, in one or more embodiments, the scene-based image editing system 106 receives user input to resize the object and determines a new location of the object based on the resized input. To illustrate, in one or more embodiments, the scene-based image editing system 106 assigns perspective scaling attributes to objects depicted in the digital image. Furthermore, the scene-based image editing system 106 initializes the perspective scale (e.g., by giving an initial value of 1.0). In some embodiments, the scene-based image editing system 106 also receives user input to increase or decrease the size of the object (e.g., through a pinch gesture detected via a graphical user interface displaying a digital image) and adjusts the size of the object accordingly. For example, in some implementations, the scene-based image editing system 106 determines a scale factor based on user input and multiplies the perspective scale of the object by the scale factor to obtain a new value of the perspective scale. The scene-based image editing system 106 uses the new values of the perspective scale to determine the new size of the object. The scene-based image editing system 106 also uses the determined scale factor to determine a new perspective-based position of the object via linear interpolation based on the original position of the object relative to the vanishing point.

In one or more embodiments, the scene-based image editing system 106 also generates one or more perspective-based dimensional previews for objects depicted within the digital image. In addition, the scene-based image editing system 106 provides perspective-based dimensional previews for display within the graphical user interface to support perspective-aware object movement operations on the object. FIG. 43 illustrates a graphical user interface used by the scene-based image editing system 106 to provide a plurality of perspective-based dimensional previews for objects depicted within a digital image in accordance with one or more embodiments.

In one or more embodiments, the perspective-based dimensional preview includes a visual indication of how to adjust the size of the object as the object moves relative to the perspective of the digital image depicting the object. In particular, in some embodiments, the perspective-based dimensional preview includes a visual indication of the result of the perspective-based dimensional adjustment of the object as it moves within the digital image relative to the perspective of the digital image (e.g., relative to the vanishing point located on the horizontal line). For example, in some cases, the perspective-based dimensional preview includes a visual preview of the resulting size of the object at a location within the digital image that is different from the current location of the object, another location being located on a line from the vanishing point to (and through) the object.

In effect, as shown in FIG. 43, the scene-based image editing system 106 provides a digital image 4306 for display within the graphical user interface 4302 of the client device 4304. In addition, as shown, the scene-based image editing system 106 detects selection of an object 4308 depicted within the digital image via one or more user interactions with the graphical user interface 4302 and selection of a selectable option 4310 for performing perspective-aware object movement operations on the object 4308.

As shown in fig. 43, in response to detecting selection of selectable option 4310, scene-based image editing system 106 provides perspective-based dimensional previews 4312a-4312c for display within graphical user interface 4302. In particular, the scene-based image editing system 106 provides perspective-based dimensional previews 4312a-4312c along a line 4314 from a vanishing point 4316 of the digital image 4306 through the object 4308.

In one or more embodiments, the scene-based image editing system 106 generates perspective-based dimensional previews 4312a-4312c when the object 4308 is selected or when the selectable option 4310 is selected. In some cases, when preprocessing the digital image 4306, the scene-based image editing system 106 generates perspective-based dimensional previews 4312a-4312c.

In some embodiments, the scene-based image editing system 106 generates perspective-based dimensional previews 4312a-4312c by selecting positions on the line 4314 and determining the resulting dimensions of the object 4308 as it moves to those positions. For example, in some cases, scene-based image editing system 106 selects locations along line 4314 at regular intervals. In some cases, the scene-based image editing system 106 generates at least one perspective-based dimensional preview that indicates a larger size than the size of the object 4308 at its current position if the object 4308 is moved away from the vanishing point 4316 and that indicates a smaller size than the size of the object 4308 at its current position if the object 4308 is moved toward the vanishing point 4316. However, in some cases, the scene-based image editing system 106 only generates a perspective-based preview of a larger size or a smaller size of the pointing object 4308. For example, as shown in FIG. 43, because the object 4308 is initially positioned away from the vanishing point 4316, the scene-based image editing system 106 generates perspective-based dimensional previews 4312a-4312c to indicate the smaller size of the object 4308 as the object 4308 is moved closer to the vanishing point 4316. In different embodiments, the number of perspective-based dimensional previews generated and provided for display is different.

In one or more embodiments, the scene-based image editing system 106 modifies the perspective-based dimensional preview displayed within the graphical user interface 4302 as the object 4308 is moved along the line 4314. For example, in some cases, as object 4308 is moved along line 4314, scene-based image editing system 106 also moves perspective-based dimensional previews 4312a-4312c along the line. Furthermore, the scene-based image editing system 106 adjusts the perspective-based size previews 4312a-4312c based on their new locations on the line (e.g., using corresponding perspective scaling). In some cases, the scene-based image editing system 106 adds a new perspective-based dimensional preview when more space on line 4314 becomes available, or deletes a perspective-based dimensional preview when space on line 4314 decreases. For example, in some embodiments, when moving object 4308 along line 4314 to vanishing point 4316, scene-based image editing system 106 determines that there is sufficient space on the portion of line 4314 to the left of object 4308 and generates and provides a perspective-based dimensional preview for display along the portion of line 4314 (indicating a larger size of object 4308 if moved to the position of the perspective-based dimensional preview). Thus, in some implementations, the scene-based image editing system 106 dynamically modifies a set of perspective-based dimensional previews shown based on the current positioning of the object 4308.

In one or more embodiments, when performing perspective-aware object movement operations on the digital image, the scene-based image editing system 106 further modifies the digital image to appropriately reflect the occlusion of objects depicted within the digital image. In particular, when an object is moved to overlap with another object, the scene-based image editing system 106 modifies the object or the other object such that the overlapping portion is obscured. In some embodiments, to facilitate proper occlusion of objects within the digital image, the scene-based image editing system 106 determines an object depth of the object. FIG. 44 illustrates a scene-based image editing system 106 that determines object depths of objects depicted within a digital image in accordance with one or more embodiments.

As shown in fig. 44, the scene-based image editing system 106 analyzes the digital image 4402 depicting the first object 4404a and the second object 4404 b. As shown, the first object 4404a and the second object 4404b are separated by a distance. In other words, the first object 4404a and the second object 4404b do not currently overlap.

As further shown, the scene-based image editing system 106 analyzes the digital image 4402 using the depth prediction neural network 4406. In one or more embodiments, the depth-predictive neural network includes a computer-implemented neural network that determines depth information for objects (e.g., object depths) or other components depicted within the digital image. In particular, in some embodiments, the depth prediction neural network includes a computer-implemented neural network that analyzes the digital image and determines depth information for the object or other delineated component based on the analysis. It should be noted that the terms "depth prediction neural network" and "depth estimation neural network" may be used interchangeably. Thus, in one or more embodiments, the depth prediction neural network 4406 comprises one of the depth estimation neural networks described above.

As further shown in fig. 44, the scene-based image editing system 106 determines the object depth 4408 based on an analysis of the digital image 4402 using the depth prediction neural network 4406. For example, in some cases, the scene-based image editing system 106 determines the object depth 4408 by determining a first object depth of the first object 4404a and a second object depth of the second object 4404 b.

In one or more embodiments, the object depth includes a metric or set of metrics corresponding to the depth of objects depicted within the digital image. In particular, in some embodiments, the object depth includes one or more values indicative of the depth of the object relative to the scene depicted by the digital image. In some cases, the scene-based image editing system 106 measures the depth of an object (e.g., the distance between the object and the camera) or some other reference point from the camera capturing the digital image, whether that reference point appears within the digital image or appears outside the digital image.

In various embodiments, the scene-based image editing system 106 uses various metrics to represent the object depth of an object. For example, in some embodiments, scene-based image editing system 106 uses a set of values, where each value in the set represents a depth of a pixel that contributes to the rendering of the object. In some cases, scene-based image editing system 106 utilizes a maximum depth value or a minimum depth value determined for pixels contributing to the object. In some implementations, the scene-based image editing system 106 uses a total depth value (e.g., a value obtained by adding the depth values of each pixel contributing to the object). Further, in one or more embodiments, scene-based image editing system 106 utilizes the average object depth as the object depth of the object. In one or more embodiments, the average object depth of the object includes an average depth value over a set of pixels that contribute to the object's depiction.

In one or more embodiments, the scene-based image editing system 106 determines the object depth 4408 as part of the preprocessing of the digital image 4402. In some cases, scene-based image editing system 106 includes object depth 4408 in the semantic scene graph generated for digital image 4402. Thus, in some implementations, the scene-based image editing system 106 references the semantic scene graph when performing perspective-aware object movement operations. For example, in some cases, scene-based image editing system 106 references a semantic scene graph to retrieve object depth 4408 for comparison upon detection of a user selection to perform perspective-aware object movement operations.

In some cases, as described below, the scene-based image editing system 106 determines an updated object depth as the first object 4404a or the second object 4404b moves within the digital image 4402. Thus, in some implementations, the scene-based image editing system 106 uses the updated object depths to modify the digital image 4402 to reflect the appropriate occlusion between the objects 4404a-4404 b.

45A-45C illustrate a graphical user interface used by the scene-based image editing system 106 to reflect occlusions between objects within a digital image in accordance with one or more embodiments. In effect, as shown in fig. 45A, the scene-based image editing system 106 provides a digital image 4506 depicting a first object 4508a and a second object 4508b for display within a graphical user interface 4502 of the client device 4504. Fig. 45A shows that the first object 4508a is farther from the vanishing point 4510 associated with the digital image 4506 than the second object 4508 b. In one or more embodiments, scene-based image editing system 106 determines that the object depth of second object 4508b is greater than the object depth of first object 4508a based on the positioning of second object 4508b and first object 4508 a. As further shown, the scene-based image editing system 106 has detected selection of the first object 4508a through user interaction with the graphical user interface 4502.

As shown in fig. 45B, the scene-based image editing system 106 moves and resizes the first object 4508a through perspective-aware object movement operations in response to further user interaction with the graphical user interface 4502. Specifically, the scene-based image editing system 106 moves the first object 4508a toward the vanishing point 4510, and reduces the size of the first object 4508a based on the movement. As further shown, scene-based image editing system 106 moves first object 4508a such that first object 4508a is closer to vanishing point 4510 than second object 4508 b. In addition, the scene-based image editing system 106 moves the first object 4508a such that the first object 4508a overlaps the second object 4508 b. In particular, the scene-based image editing system 106 creates an overlap region in which a portion of the first object 4508a overlaps a portion of the second object 4508 b.

In one or more embodiments, scene-based image editing system 106 determines an updated object depth for first object 4508a based on the movement of first object 4508 a. In particular, the scene-based image editing system 106 determines the object depth of the first object 4508a at the location resulting from the movement. In some cases, scene-based image editing system 106 tracks (or continuously updates) the object depth of first object 4508a as first object 4508a moves. In some embodiments, when the first object 4508a stops moving or is deselected, the scene-based image editing system 106 determines an updated object depth. In some cases, as described above, the scene-based image editing system 106 maintains the perspective scale attributes of the first object 4508a and updates the value of the perspective scale as the first object 4508a is resized and/or moved within the digital image 4806. Thus, in some embodiments, the scene-based image editing system 106 determines the updated object depth of the first object 4508a by multiplying the initial object depth (e.g., initial average object depth) of the first object 4508a by the updated value of the perspective scale.

In some implementations, based on the object depth of the first object 4508a at its location caused by the movement, the scene-based image editing system 106 determines that the object depth of the first object 4508a is now greater than the object depth of the second object 4508b. Indeed, in some cases, scene-based image editing system 106 compares the object depth of first object 4508a at its resulting location with the object depth of second object 4508b, and determines that the object depth of first object 4508a is greater.

As shown in fig. 45B, although the object depth of the first object 4508a is large, the scene-based image editing system 106 depicts the first object 4508a occluding the second object 4508B within the overlapping region. Indeed, in some cases, the scene-based image editing system 106 depicts the selected objects as occluding objects that were not selected when they overlapped to provide the user with a full view of the selected objects (e.g., to facilitate desired positioning). However, in some implementations, once the selected object overlaps another object, the scene-based image editing system 106 depicts the appropriate occlusion. Indeed, in some cases, once the overlap is detected, the scene-based image editing system 106 depicts the correct occlusion, providing a preview of the resulting occlusion at the end of the object movement.

As shown in fig. 45C, scene-based image editing system 106 modifies digital image 4506 to obscure first object 4508a with second object 4508b based on first object 4508a having a greater object depth than second object 4508b (e.g., when first object 4508a is deselected). In particular, the scene-based image editing system 106 occludes the first object 4508a within the region of overlap. In other words, the scene-based image editing system 106 uses portions of the second object 4508b within the overlap region to mask portions of the first object 4508a within the overlap region.

It is noted that fig. 45A-45C also illustrate that scene-based image editing system 106 completes first object 4508a as first object 4508a moves within digital image 4506. In fact, as shown in fig. 45A, the first object 4508a is only partially displayed within the digital image 4506. For example, in some cases, when the digital image 4506 is captured, the top of the first object 4508a is within the frame (and captured as part of the digital image 4506), while the bottom of the first object 4508a is not within the frame (and not captured as part of the digital image 4506). Thus, in some cases, the scene-based image editing system 106 does not have data for the bottom of the first object 4508a.

However, as shown in fig. 45B-45C, the scene-based image editing system 106 provides a first object 4508a for displaying its entirety within the digital image 4506. Thus, in some implementations, the scene-based image editing system 106 modifies the first object 4508a by completing the first object 4508a so as to be displayed entirely within the digital image 4506. In other words, in some cases, the scene-based image editing system 106 generates data corresponding to the bottom of the first object 4508a.

In particular, in some embodiments, scene-based image editing system 106 modifies first object 4508a by generating a bottom of first object 4508a. For example, in some cases, scene-based image editing system 106 generates one or more fill modifications that complete first object 4508a by filling the bottom. In one or more embodiments, the scene-based image editing system 106 completes the first object 4508a using the semantic graph MODEL described in U.S. application Ser. No.18/190,513 entitled "UTILIZING A GENERATIVE MACHINE LEARNING MODEL AND GRAPHICAL USER INTERFACE FOR CREATING MODIFIEDDIGITAL IMAGES FROM AN INFILL SEMANTIC MAP," filed on 3/27, 2023, which is incorporated herein by reference in its entirety, to generate a semantic machine learning MODEL and/or to generate an image machine learning MODEL.

In one or more embodiments, the scene-based image editing system 106 completes the first object 4508a as part of a workflow for preprocessing the digital image 4506. In fact, as previously described, the scene-based image editing system 106 pre-processes the digital image 4506 in some cases before receiving user input modifying the digital image 4506. In some embodiments, scene-based image editing system 106 pre-processes digital image 4506 by analyzing first object 4508a and generating one or more fill modifications that complete first object 4508a. However, in some cases, the scene-based image editing system 106 completes the first object 4508a upon determining that the first object 4508a is being moved so as to expose a previously unseen portion.

Although fig. 45A-45C illustrate the scene-based image editing system 106 completing the first object 4508a by generating a portion that was not originally depicted by the digital image 4506 due to being outside of a frame when the digital image 4506 was captured, in some embodiments the scene-based image editing system 106 similarly completes an object that is partially occluded by other objects within the digital image. To illustrate, in some cases, the bottom of the first object 4508a is obscured by another object depicted in the digital image 4506 (rather than outside the frame). Thus, the scene-based image editing system 106 modifies the first object 4508a by generating a bottom, for example by using the semantic graph model described above, generating a semantic machine learning model and/or generating an image machine learning model.

Thus, in one or more embodiments, scene-based image editing system 106 provides appropriate occlusion of objects depicted within the digital image based on the movement of the objects and their relative object depths. Furthermore, in some embodiments, scene-based image editing system 106 provides appropriate differentiation of objects within the digital image. In particular, when the digital image is modified such that an object is no longer occluded (or less occluded) by another object, the scene-based image editing system 106 modifies the object to expose more objects. In some cases, where a portion of a previously occluded object was not originally captured as a portion of a digital image (e.g., because it was outside a frame or occluded by another object), the scene-based image editing system 106 completes the object for display by generating a previously unobscured portion.

In one or more embodiments, rather than using object depth to reflect occlusion between overlapping objects, scene-based image editing system 106 utilizes a disparity metric. In some cases, scene-based image editing system 106 establishes disparities as inverse depths, where objects closer to the horizon have less disparities than objects farther from the horizon. Thus, in some embodiments, the scene-based image editing system 106 determines that an object with a greater disparity occludes an overlapping object with a smaller disparity. In some implementations, the scene-based image editing system 106 utilizes a reference point or reference line other than horizontal to determine parallax.

By performing perspective-aware object movement operations as described above, the scene-based image editing system 106 provides improved flexibility and efficiency over conventional systems. For example, the scene-based image editing system 106 flexibly performs image editing according to a three-dimensional environment captured within a digital image. In particular, the scene-based image editing system 106 flexibly incorporates a three-dimensional perspective (e.g., vanishing point) of an object into the editing process by automatically adjusting the size of the object based on the movement of the object relative to the perspective. Additionally, the scene-based image editing system 106 also incorporates three-dimensional perspective by automatically modifying the digital image based on the object depth of the overlapping objects to depict the correct occlusion between the overlapping objects. The scene-based image editing system 106 further reduces the user interaction typically required to perform these edits under conventional systems by automatically resizing and occluding objects based on their movement relative to the perspective of the digital image. In effect, the scene-based image editing system 106 folds multiple edits into a single perspective-aware object movement operation that is implemented in response to a combined set of user interactions.

In one or more embodiments, the scene-based image editing system 106 implements depth-aware object movement operations when editing digital images. In particular, as an object is moved within the digital image, the scene-based image editing system 106 modifies the digital image to reflect the appropriate occlusion between the object and another object that overlaps the moved object. In some implementations, the depth-aware object moving operation differs from the perspective-aware object moving operation in that when the depth-aware object moving operation is performed, the scene-based image editing system 106 does not move and resize objects according to the perspective of the digital image. In an alternative implementation, depth-aware object movement is performed in conjunction with perspective-aware movement such that once an object is moved and resized, scene-based image editing system 106 also ensures proper occlusion with any other objects in the scene of the digital image. By incorporating object occlusion based on object depth, the scene-based image editing system 106 incorporates three-dimensional effects in the editing process. 46-48C illustrate a scene-based image editing system 106 that implements depth-aware object movement operations in accordance with one or more embodiments.

Many conventional image editing systems are inflexible and inefficient in reflecting the proper occlusion between overlapping objects in a digital image. For example, conventional systems often cannot flexibly depict occlusions between objects unless the objects overlap when capturing a digital image (e.g., the occlusions are captured as part of the digital image). In fact, conventional systems often cannot delineate occlusions when one object moves to overlap another object without receiving user input that directs the occlusion. For example, some conventional image editing systems strictly require user input indicating layering of an object within a digital image in order to properly obscure the object when it overlaps another object. As such, many conventional systems tightly manage digital images as flat surfaces, with objects depicted therein existing at the same depth.

Since many conventional image editing systems rely on user input to provide proper occlusion between overlapping objects, such systems are inefficient to operate. In fact, such conventional systems typically require user interaction to indicate which object is to be obscured (i.e., obscured objects) and which object is to be obscured (whether the user interaction indicates top and bottom layers or indicates which object is intended to appear in front of another object). Thus, these systems provide an inefficient graphical user interface that relies on user interaction to properly react to the overlapping of objects within the digital image. Furthermore, with many conventional systems, users must perform intensive editing operations to provide a realistic appearance between overlapping objects.

The scene-based image editing system 106 utilizes depth-aware object movement operations to provide improved flexibility over conventional systems. For example, the scene-based image editing system 106 flexibly accommodates depth differences between objects within a scene captured in a digital image by automatically modifying the digital image to depict occlusions between overlapping objects when the object is moved to overlap with another object. In effect, the scene-based image editing system 106 flexibly edits digital images to real world scenes in which objects exist in a three-dimensional environment.

Furthermore, by modifying the digital image to reflect the appropriate occlusion between the moving object and another object, the scene-based image editing system 106 reduces the user interaction typically required by conventional systems to incorporate such occlusion. For example, the scene-based image editing system 106 does not require user input to indicate which object is occluded and which object is to be occluded, and to cut or otherwise modify the object to provide the appropriate occlusion.

As described above, in one or more embodiments, the scene-based image editing system 106 utilizes object depths of objects depicted in digital images to implement depth-aware object movement operations. Thus, in some embodiments, the scene-based image editing system 106 analyzes the digital image to determine an object depth of an object depicted therein. FIG. 46 illustrates a scene-based image editing system 106 that determines object depths of objects depicted in digital images in accordance with one or more embodiments.

In one or more embodiments, the depth aware object movement operations include editing operations for moving objects within the digital image and occluding at least one object based on the movement. Indeed, in some cases, depth aware object movement operations include one type of object aware modification in that it targets objects identified within the digital image. In some cases, depth-aware object movement operations differ from standard movement operations in that the depth-aware object movement operations result in automatic occlusion of one or more overlapping objects based on their respective object depths. Indeed, as will be discussed, in one or more embodiments, the scene-based image editing system 106 compares object depths of objects that overlap as a result of object movement and obscures at least one object based on the comparison.

In fact, as shown in fig. 46, the scene-based image editing system 106 analyzes the digital image 4602 depicting the first object 4604a and the second object 4604 b. The scene-based image editing system 106 uses a depth prediction neural network 4606 (e.g., one of the depth estimation neural networks described above) to analyze the digital image 4602 and determine an object depth 4608 based on the analysis. Specifically, the scene-based image editing system 106 determines a first object depth of the first object 4604a and a second object depth of the second object 4604 b.

As described above, in some cases, when performing perspective-aware object movement operations, the scene-based image editing system 106 updates the object depth because movement of the object relative to its perspective of the digital image will change the depth of the object. In contrast, in one or more embodiments, scene-based image editing system 106 maintains an initial object depth of an object when performing depth-aware object movement operations because such operations do not change the depth of the object when the object is moved. In other words, in some cases, the scene-based image editing system 106 utilizes a depth-aware object movement operation to move objects vertically or horizontally in the image plane of the digital image, but without changing the distance of the objects to the vanishing point of the digital image.

Thus, in some cases, the scene-based image editing system 106 determines the object depths 4608 of the first object 4604a and the second object 4604b when they are positioned separately, and uses the object depths when the first object 4604a or the second object 4604b is moved to overlap with the other. For example, as previously described, in some implementations, the scene-based image editing system 106 pre-processes the digital image 4602 to determine the object depth 4608.

As further shown, the scene-based image editing system 106 analyzes the digital image 4602 using a depth prediction neural network 4606. In one or more embodiments, the depth-predictive neural network includes a computer-implemented neural network that determines depth information for objects (e.g., object depths) or other components depicted within the digital image. In particular, in some embodiments, the depth prediction neural network includes a computer-implemented neural network that analyzes the digital image and determines depth information for the object or other delineated component based on the analysis. It should be noted that the terms "depth prediction neural network" and "depth estimation neural network" may be used interchangeably. Thus, in one or more embodiments, the depth prediction neural network 4606 comprises one of the depth estimation neural networks described above.

As further shown in fig. 46, the scene-based image editing system 106 determines the object depth 4608 based on an analysis of the digital image 4602 using the depth prediction neural network 4606. For example, in some cases, scene-based image editing system 106 determines object depth 4608 by determining a first object depth of first object 4604a and a second object depth of second object 4604 b.

In one or more embodiments, the object depth includes a metric or set of metrics corresponding to the depth of objects depicted within the digital image. In particular, in some embodiments, the object depth includes one or more values indicative of the depth of the object relative to the scene depicted by the digital image. In some cases, the scene-based image editing system 106 measures the object depth (e.g., the distance between the object and the camera) of the camera capturing the digital image or some other reference point, whether that reference point appears within the digital image or appears outside the digital image.

In one or more embodiments, the scene-based image editing system 106 determines the object depth 4608 as part of the preprocessing of the digital image 4602. In some cases, scene-based image editing system 106 includes object depth 4608 in the semantic scene graph generated for digital image 4602. Thus, in some implementations, the scene-based image editing system 106 references the semantic scene graph when performing perspective-aware object movement operations. For example, in some cases, scene-based image editing system 106 references a semantic scene graph to retrieve object depth 4608 for comparison upon detecting a user selection to perform perspective-aware object movement operations.

In one or more implementations, the scene-based image editing system 106 utilizes a depth prediction neural network, as described in U.S. application Ser. No.17/186436, entitled "GENERATING DEPTH IMAGES UTILIZING AMACHINE-LEARNING MODEL BUILT FROM MIXEDDIGITAL IMAGESOURCES AND MULTIPLE LOSS FUNCTION SETS," filed on 26, 2, 2021, which is incorporated herein by reference in its entirety. Alternatively, scene-based image editing system 106 utilizes a depth-predictive neural network, as described in U.S. application Ser. No.17/658,873, entitled "UTILIZING MACHINE LEARNING MODELS TO GENERATE REFINED DEPTH MAPS WITH SEGMENTATION MASK GUIDANCE," filed on 4-12 of 2022, which is incorporated herein by reference in its entirety. Then, when editing the object to perform real scene editing, the scene-based image editing system 106 accesses depth information of the object (e.g., an average depth of the object) from the semantic scene graph 1412. For example, as objects are moved within an image, scene-based image editing system 106 then accesses depth information for objects in the digital image from semantic scene graph 1412 to ensure that the object being moved is not placed in front of the object having the lesser depth.

Fig. 47A-47C illustrate a graphical user interface implemented by the scene-based image editing system 106 to perform depth-aware object movement operations in accordance with one or more embodiments. As shown in fig. 47, the scene-based image editing system 106 provides for displaying a digital image 4706 depicting a first object 4708a and a second object 4708b within a graphical user interface 4702 of a client device 4704. As shown, the first object 4708a and the second object 4708b are each positioned within the digital image 4706 such that they do not overlap. As further shown, the scene-based image editing system 106 determines that the second object 4708b is selected.

As shown in fig. 47B, in response to one or more further user interactions with the graphical user interface 4702, the scene-based image editing system 106 moves the second object 4708B within the digital image 4706. In particular, the scene-based image editing system 106 moves the second object 4708b to create an overlapping region between the first object 4708a and the second object 4708b. As shown in fig. 47B, when the second object 4708B is moved to overlap with the first object 4708a, the scene-based image editing system obscures the second object 4708B with the first object 4708 a. Specifically, the scene-based image editing system 106 obscures portions of the second object 4708 with overlapping portions of the first object 4708 a. Thus, the scene-based image editing system 106 provides a view of occlusions between objects, even when the second object 4708b is still selected and in the process of being moved.

In one or more embodiments, as previously described, the scene-based image editing system 106 pre-processes the digital image 4706 prior to receiving a user interaction for modifying the digital image 4706. For example, as described above, the scene-based image editing system 106 pre-processes the digital image 4706 to determine object depths of the first object 4708a and the second object 4708 b. Thus, in some cases, when there is an overlap before moving either object, the scene-based image editing system 106 determines that the first object 4708a will be an occluding object and the second object 4708b will be an occluding object. Indeed, in some cases, when preprocessing the digital image 4706, the scene-based image editing system 106 compares object depths of objects to identify occluding objects and occluding objects. However, in some implementations, the scene-based image editing system 106 compares object depths to identify occlusion objects and occlusion objects when movement of one of the objects is detected or when object overlap is determined.

In some implementations, the scene-based image editing system 106 determines the object depths of the first object 4708a and the second object 4708b as part of a workflow for preprocessing digital images. Thus, the scene-based image editing system 106 utilizes the results of the preprocessing workflow in modifying the digital image.

To illustrate, in one or more embodiments, the scene-based image editing system 106 pre-processes the digital image 4706 to identify a first object 4708a and a second object 4708b within the digital image. Further, in some embodiments, the scene-based image editing system 106 pre-processes the digital image 4706 to generate object masks for the first object 4708a and the second object 4708b. In some cases, the scene-based image editing system 106 utilizes a segmented neural network (e.g., the detect-mask neural network 300 discussed above with reference to fig. 3) to detect and generate object masks for the first object 4708a and the second object 4708b. Thus, upon detecting the user interaction indicated by FIG. 47A, the scene-based image editing system 106 determines that the user interaction is targeted for the second object 4708b because the second object 4708b has been identified. Further, upon detecting one or more user interactions shown in FIG. 47B, the scene-based image editing system 106 moves the second object 4708B within the digital image 4706 because the second object 4708 has been segmented. In some cases, the scene-based image editing system 106 uses the object masks of each of the first object 4708a and the second object 4708b to determine their object depths.

In some cases, the scene-based image editing system 106 generates content fills of the first object 4708a and the second object 4708b as part of the preprocessing workflow. For example, in one or more embodiments, scene-based image editing system 106 uses a content-aware hole-filling machine learning model (e.g., cascade modulation shading neural network 420 discussed above with reference to fig. 4) to generate the content fills of first object 4708a and second object 4708 b. In some cases, scene-based image editing system 106 generates a complete background for digital image 4706 using content population. For example, in some implementations, the scene-based image editing system 106 locates the generated content fills behind corresponding objects (or object masks) within the digital image. Thus, as shown in fig. 47B, when the second object 4708B is moved from its initial position, the scene-based image editing system 106 exposes the corresponding content fill such that the background of the digital image 4706 has been rendered complete.

As shown in fig. 47C, the scene-based image editing system 106 determines that the second object 4708b has been deselected. Thus, the scene-based image editing system 106 determines that the movement of the second object 4708b has been completed. Thus, upon deselecting the second object 4708b, the scene-based image editing system 106 provides a final occlusion based on the overlap between the first object 4708a and the second object 4708 b.

As suggested previously, in one or more embodiments, scene-based image editing system 106 adds object depth to the semantic scene graph generated for digital image 4706 (e.g., as part of a preprocessing workflow). Thus, in some cases, when performing depth aware object movement operations, scene based image editing system 106 references a semantic scene graph. For example, in some cases, scene-based image editing system 106 references a semantic scene graph to retrieve and compare object depths to create an overlapping region with first object 4708a in response to moving second object 4708 b.

In some embodiments, as previously described, the scene-based image editing system 106 also compares object depths to determine which object has a greater or lesser object depth before the second object 4708b is moved. For example, in some cases, as part of the preprocessing of the digital image 4706, the scene-based image editing system 106 compares object depths. In some cases, scene-based image editing system 106 uses preprocessing to determine the level of object depth and includes this level in the semantic scene graph generated for digital image 4706. Thus, in some implementations, the scene-based image editing system 106 references the semantic scene graph (e.g., to retrieve a ranking) when performing depth-aware object movement operations.

Fig. 47C depicts a scenario in which the scene-based image editing system 106 determines that the object depth of the second object 4708b is greater than the object depth of the first object 4708a and accordingly occludes the second object 4708b with the first object 4708a within the overlapping region. Specifically, the scene-based image editing system 106 uses portions of the first object 4708a within the overlap region to mask portions of the second object 4708b within the overlap region. In one or more embodiments, in a scene where the scene-based image editing system 106 determines that the object depth of the second object 4708b is less than the object depth of the first object 4708a, the scene-based image editing system 106 occludes the first object 4708a with the second object 4708b within the overlapping region.

Thus, in some implementations, the scene-based image editing system 106 implements depth-aware object movement operations using various other features of image editing provided by the scene-based image editing system 106. Indeed, in some cases, the scene-based image editing system 106 utilizes previous object detection and segmentation performed as part of the preprocessing workflow. Further, the scene-based image editing system 106 is populated with content generated as part of the preprocessing workflow. In this way, the scene-based image editing system 106 provides a seamless experience when moving an object to create an overlay with another object that causes an occlusion, where the background already appears complete, and the object can be moved with the appropriate occlusion provided in real-time without waiting for further processing.

48A-48C illustrate another graphical user interface implemented by the scene-based image editing system 106 to perform depth-aware object movement operations in accordance with one or more embodiments. In effect, as shown in fig. 48A, the scene-based image editing system 106 provides a digital image 4806 for display within the graphical user interface 4802 of the client device 4804, the digital image 4806 depicting a first object 4808A, a second object 4808b, a third object 4808c, and a fourth object 4810. Since four objects are depicted in digital image 4806, scene-based image editing system 106 determines four object depths. In addition, the scene-based image editing system 106 determines a hierarchy or level of object depth. In particular, the scene-based image editing system 106 identifies an object depth having a maximum depth value, an object depth having a minimum depth value, and an object depth having a depth value therebetween. Thus, the scene-based image editing system 106 provides occlusion differently based on which objects overlap due to movement of one of the objects. However, as shown in fig. 48A, the digital image 4806 initially depicts a third object 48108 c partially occluded by a fourth object 4810. Specifically, the rear foot of third object 48108 c is blocked from view by fourth object 4810. Thus, in some cases, scene-based image editing system 106 also provides differentiation based on movement of one or more objects. As shown in fig. 48A, the scene-based image editing system 106 has detected selection of the third object 4808c via one or more user interactions with the graphical user interface 4802.

As shown in fig. 48B, the scene-based image editing system 106 moves the third object 4808c to create an overlapping region with the second object 4808B. As further shown, the scene-based image editing system 106 uses the third object 4808c within the overlap region to obscure the second object 4808b. In particular, the scene-based image editing system 106 determines that the object depth of the second object 4808b is greater than the object depth of the third object 4808c and accordingly obscures the second object 4808b within the overlapping region. In some cases, scene-based image editing system 106 determines which object has a greater object depth and which object has a lesser object depth when second object 4808b and third object 4808c are detected to overlap. In some cases, as described above, the scene-based image editing system 106 determines which object has a greater object depth and which object has a lesser object depth before the objects overlap and in some cases even before the third object 4808c is selected for movement (e.g., as part of preprocessing).

As further shown in fig. 48B, as the third object 48108 c moves away from the fourth object 4810, the scene-based image editing system 106 exposes a portion (i.e., the hindfoot) of the third object 4810c that was previously obscured by the fourth object 4810. In other words, the scene-based image editing system 106 eliminates the portion of the third object 4810c by presenting the portion for display. In one or more embodiments, scene-based image editing system 106 eliminates third object 4808c by generating previously unseen portions. In fact, when the digital image 4806 is initially captured with the hindfoot of the third object 4818 c blocked by the fourth object 4810, the scene-based image editing system 106 does not capture data corresponding to the hindfoot. Thus, in some cases, the scene-based image editing system 106 generates content (e.g., fill modifications) to use as a backend of the third object 4808c. In one or more embodiments, the scene-based image editing system 106 generates content using semantic graph models described in U.S. application Ser. No. 18/190,513, generating semantic machine learning models, and/or generating image machine learning models. Thus, when one or more objects are moved to create and/or remove overlapping areas, the scene-based image editing system 106 obscures and/or evacuates objects within the digital image.

As shown in fig. 48C, the scene-based image editing system 106 moves the third object 4808C farther to create an overlap region between the first object 4808a and the third object 4808C. As further shown, the scene-based image editing system 106 uses the first object 4808a within the overlap region to obscure the third object 4808c. Specifically, the scene-based image editing system 106 determines that the object depth of the third object 4808c is greater than the object depth of the first object 4808a and accordingly occludes the third object 4808c within the overlapping region. Notably, the scene-based image editing system 106 obscures the third object 4808c while the third object 4808c is still selected. Thus, in such embodiments, the scene-based image editing system 106 facilitates real-time viewing of the resulting occlusion before movement of the third object 4808c has been completed.

In some cases, when third object 4808c begins to overlap with first object 4808a, scene-based image editing system 106 obscures third object 4808c with first object 4808 a. For example, upon determining that a portion of third object 4808c overlaps a portion of first object 4808a, scene-based image editing system 106 obscures that portion of third object 4808c. As the third object 4808c moves further such that a larger portion of the third object 4808c overlaps a larger portion of the first object 4808a (i.e., the overlap area increases), the scene-based image editing system 106 updates the occlusion to occlude the larger portion of the third object 4808c.

More generally, in one or more embodiments, the scene-based image editing system 106 dynamically modifies the digital image in real-time to depict occlusion based on the current positioning of overlapping objects. Thus, when an occluding object (e.g., an object having a relatively small object depth) begins to overlap with another object (e.g., an object having a relatively large object depth), the scene-based image editing system 106 occludes those pixels of the other object that overlap with the occluding object. As the overlap area increases such that additional pixels of another object overlap with the occluding object, the scene-based image editing system 106 modifies the digital image such that additional pixels of another object are also occluded. Conversely, when the occluding object is moved such that the overlap region decreases and pixels of another object leave the overlap region, the scene-based image editing system 106 modifies the digital image to expose those pixels of the other object that have left the overlap region.

Similarly, in one or more embodiments, as the overlapping object portions change, the scene-based image editing system 106 modifies the digital image to update occlusions between objects. To illustrate, where the occluding object overlaps another object to create an initial overlap region, the scene-based image editing system 106 occludes the other object with the occluding object within the overlap region. Furthermore, in the event that either object has been moved to create a subsequent overlap region, scene-based image editing system 106 again modifies the digital image such that the other object is obscured by the occluding object within the subsequent overlap region. Thus, in some cases, the scene-based image editing system 106 operates in real-time to depict occlusions caused by the current positioning of objects.

As further shown in fig. 48C, upon moving the third object 4808C away from the second object 4808b (e.g., outside of the overlap region), the scene-based image editing system 106 exposes a portion of the second object 4808b that was previously occluded by the third object 4808C. In addition, as shown in fig. 48B-48C, the scene-based image editing system 106 pre-generates content fills for the third object 4808C and exposes the content fills when the third object 4808C is moved.

By performing depth-aware object movement operations as described above, the scene-based image editing system 106 provides improved flexibility and efficiency over conventional systems. For example, the scene-based image editing system 106 flexibly accommodates the positioning of objects in a three-dimensional environment captured by a digital image by incorporating object depths of objects depicted in the digital image to occlude at least one object. In effect, the scene-based image editing system 106 facilitates movement of objects within the digital image while maintaining consistency of the three-dimensional environment. Furthermore, by automatically occluding an object as it or another object moves within the digital image, the scene-based image editing system 106 eliminates the user interaction typically required to provide occlusion in many conventional systems. In effect, the scene-based image editing system 106 provides a more efficient graphical user interface by providing appropriate occlusion in response to user input for moving objects.

In one or more embodiments, the scene-based image editing system 106 occludes and/or de-occludes objects of the digital image when using properties of the digital image other than object depth to move the objects. For example, in some cases, the scene-based image editing system 106 determines a hierarchy, ranking, or order of objects within the digital image without using object depths. For example, in some cases, scene-based image editing system 106 determines the order of objects based on received user input regarding digital images. In some implementations, the user determines the order himself. In particular, in some cases, the scene-based image editing system 106 implements an algorithm for locating objects of the digital image within the ordering (e.g., left to right or right to left, using the size of the objects, or by identifying occlusions that already exist between the objects).

As described above, in some implementations, the scene-based image editing system 106 utilizes a machine learning model to generate and/or extend semantic guides to modify or complete digital images. Given a digital portrait or photograph, a client device typically seeks to remove unwanted attachments, unwanted articulators or unwanted leg/foot clippings. The scene-based image editing system 106 provides a reliable and flexible solution for efficient, accurate and flexible portrait photo editing. In effect, the scene-based image editing system 106 utilizes semantic graph guidance in the machine learning model to remove attachments or occlusions, prune digital images, and control content creation. For example, the scene-based image editing system 106 may utilize a diffusion model (or other generation machine learning model) to complete the semantic graph to progressively generate the image layout. Then, using the graph as a guide, the scene-based image editing system may generate an RGB image with authenticity.

The scene-based image editing system 106 may perform various editing processes. For example, in some implementations, the scene-based image editing system utilizes semantic graph guidance and machine learning models to remove attachments. In particular, the scene-based image editing system may crop an occlusion or an attachment from a digital image, generate a semantic map of the missing region using a first generated machine learning model, and then generate a complete digital image from the semantic map using a second generated machine learning model (remove the attachment or occlusion from the final digital image).

The scene based image editing system 106 may also crop or expand the digital image. To illustrate, a scene-based image editing system may expand the size or range of a digital image, generate a semantic map of the expanded region using a first generated machine learning model, and then generate a complete digital image using a second generated machine learning model. In this way, the scene-based image editing system may extend a close-up digital image of a person to include a more complete picture of a target object (e.g., a person's leg, foot, or arm, and corresponding background).

In one or more embodiments, the scene-based image editing system 106 performs progressive semantic completions and progressive image completions. For example, the scene-based image editing system 106 may iteratively utilize the diffusion model to generate progressively more accurate semantic graphs and/or complete digital images. In this way, the scene-based image editing system iteratively improves the accuracy of the resulting map and image to generate a true/true result.

Furthermore, the scene-based image editing system 106 may also provide more flexibility and controllability during editing. For example, the scene-based image editing system 106 may generate different pools of possible solutions for user selection. In addition, scene-based image editing system 106 may allow a client device to provide user input of strokes, style guides, or color patches in a desired region to direct image generation. In effect, by including style input within the region of the digital image, the scene-based image editing system 106 may utilize a machine learning model to extend/apply style to the entire semantic region when generating the modified digital image. Similarly, by taking into account modifications in user input/semantic regions, the scene-based image editing system 106 can flexibly guide the generation of complete digital images to accurately reflect desired features.

For example, FIG. 49 illustrates an overview of a scene-based image editing system 106 that completes digital images in accordance with one or more embodiments. For example, the scene-based image editing system 106 generates a modified digital image 4912 from the digital image 4902 by identifying objects or regions to replace or remove and/or receiving a selection of regions to remove or expand (e.g., despread).

In addition, scene-based image editing system 106 provides the option of generating modified digital image 4912 within a user interface. For example, the scene-based image editing system 106 provides more flexibility and controllability to client devices during editing. The user interface has various options indicating the type of modification (e.g., expanding a digital image or removing an object), and the scene-based image editing system 106 also provides customizable options for modifying the digital image 4902. For example, the scene-based image editing system 106 may generate different pools of possible solutions for user selection. Further, scene-based image editing system 106 may allow a client device to provide user input of strokes, style guides, or color patches at desired areas to guide image generation. In effect, by including style input within the region of the digital image 4902, the scene-based image editing system 106 may utilize a machine learning model to extend/apply style to the entire semantic region when generating the modified digital image 4912. Similarly, by taking into account modifications in user input/semantic regions, the scene-based image editing system 106 can flexibly guide the generation of complete digital images to accurately reflect desired features. Further details regarding these features are given below beginning with the description of FIG. 54A.

As described above, scene-based image editing system 106 generates modified digital image 4912 from digital image 4902. As shown, FIG. 49 illustrates a scene-based image editing system 106 that receives a digital image 4902. The scene-based image editing system 106 may receive the digital image 4902 from a digital image repository (e.g., a cloud repository of digital images or a local repository of digital images, such as camera scrolling). Similarly, the image editing system 104 may receive the digital image 4902 by capturing the digital image 4902 with a camera device. For example, digital image 4902 includes a digital frame (or boundary) that contains various pictorial elements defining digital image 4902.

As shown in FIG. 49, scene-based image editing system 106 also performs an act of determining a fill modification 4904. For example, the fill modification includes adding pixel values to the region or replacing pixel values in the region depicted within the digital image 4902. For example, to add pixel values to an area, scene-based image editing system 106 adds pixel values to an extended portion of the digital image (e.g., outside of the original digital frame of digital image 4902). Further, at the initial expansion of the frame of the digital image 4902, the expansion portion is not filled. In other examples, to replace pixel values, scene-based image editing system 106 replaces existing pixel values in digital image 4902 with new pixel values (e.g., within the original digital frame of digital image 4902). In other words, the padding modification modifies existing pixels within the digital image or adds additional pixel values to the expanded portion of the digital image frame of the digital image 4902. Although not shown, the scene-based image editing system 106 may receive various user inputs, such as a masked user input identifying an object or region to replace, a selection of an object to remove, a region or region to expand, or a style/color patch to expand within an input region.

As described above, scene-based image editing system 106 replaces or adds pixel values to regions. For example, the area indicated for filling modification includes a portion of the digital image 4902 from an existing digital image or from an expanded version of the digital image. In some cases, the region includes one or more objects within the digital image 4902. While in other cases (as already mentioned) the area comprises frames of the extended digital image and the area comprises parts corresponding to the extended frames. As shown in fig. 49, scene-based image editing system 106 performs act 4904 to determine the fill modification. In so doing, scene-based image editing system 106 determines the fill modification region as an expanded digital frame of digital image 4902.

Further, fig. 49 shows that scene-based image editing system 106 generates semantic graph 4907 from digital image 4902 using semantic graph model 4905. For example, the semantic graph model 4905 includes a segmented image neural network. The scene-based image editing system 106 may utilize various semantic segmentation models. For example, in some embodiments, scene-based image editing system 106 utilizes segmented image neural networks described by Ke Sun, yang Zhu, borujian, tianheng Cheng, bin Shaw, dong Liu, yang Dong Mu, xingang Wang, wenyu Liu, and Jingdong Wang, high resolution representations of labeled pixels and regions, arXiv:1904.04515 (2019) (or other semantic segmentation models described herein). Additional details regarding the segmented image neural network are provided below (e.g., with respect to fig. 53). As shown in fig. 49, scene based image editing system 106 receives digital image 4902 as input via semantic graph model 4905 and generates semantic graph 4907.

As described above, the scene-based image editing system 106 generates the semantic graph 4907. For example, scene-based image editing system 106 generates semantic graph 4907 (e.g., an initial semantic graph) based on the depictions within digital image 4902. Specifically, semantic graph 4907 includes pixel classifications within digital image 4902. Scene-based image editing system 106 detects classifications and objects depicted in digital image 4902 through semantic graph model 4905. For the semantic graph 4907, the scene-based image editing system 106 highlights each object of a single class with a different hue (or other visual indicator) within the semantic boundary. To illustrate, the digital image 4902 depicts a person, and the scene-based image editing system 106 generates a semantic graph by segmenting the person from the background and highlighting each sub-portion of the person. For example, the semantic graph 4907 as shown in fig. 49 highlights each sub-portion of a person, such as pants, legs, arms, shirts, faces, and hair.

As described above, the scene-based image editing system 106 detects and classifies various objects depicted in the digital image 4902. For example, an object includes a collection of pixels in the digital image 4902 that depict a person, place, or thing. To illustrate, in some embodiments, the object includes a person, an item, a natural object (e.g., a tree or rock formation) or a structure depicted in a digital image. In some cases, the object includes a plurality of elements that may be collectively distinguished from other elements depicted in the digital image 4902. For example, in some cases, the object includes a collection of buildings that make up a skyline. In some examples, the object more broadly includes (a portion of) a foreground or other element depicted in the digital image 4902 as distinct from a background. In particular, for human semantic graphs, the object includes a sub-portion of a human, such as a body portion and an article of clothing associated with the human.

As further illustrated, fig. 49 shows that scene-based image editing system 106 generates a fill semantic graph 4908 using a generative semantic machine learning model 4906. For example, as previously described, scene-based image editing system 106 performs act 4904 to determine the fill modification. Fig. 49 shows that scene-based image editing system 106 determines the fill modification as a frame of expanded digital image 4902. Thus, as shown, fig. 49 illustrates that scene-based image editing system 106 generates a filled semantic graph 4908 from semantic graph 4907 using generative semantic machine learning model 4906. Thus, the scene-based image editing system 106 expands the semantic graph 4907 via the generative semantic machine learning model 4906 to include semantic tags that add regions (corresponding to unfilled expanded frames). Because the digital image 4902 does not include any pixel values within the indicated region to be expanded, the scene-based image editing system 106 populates (e.g., repairs) semantic tags within the expanded region.

In one or more embodiments, the machine learning model includes generating a machine learning model. Specifically, generating the machine learning model includes generating or creating a machine learning model of the digital image content. For example, generating the machine learning model includes diffusing the network. Generating the machine learning model further includes generating an antagonistic neural network (GAN). For example, in GAN, two machine learning models compete with each other in zero and game play.

As described above, the scene-based image editing system 106 utilizes the generation of machine learning models. Specifically, as described above, the scene-based image editing system 106 utilizes a generative semantic machine learning model 4906. For example, the generative semantic machine learning model 4906 is a generative machine learning model that generates a semantic graph of a digital image. Specifically, the generative semantic machine learning model generates a true semantic map that fills in gaps of the expanded digital image or a semantic map 4902 of the digital image. To illustrate, the generative semantic machine learning model 4906 generates a filled semantic graph 4908 from an initially generated semantic graph (e.g., semantic graph 4907).

As described above, scene-based image editing system 106 generates a fill semantic map 4908, fill semantic map 4908 indicating semantic classifications corresponding to digital image 4902 filled with extensions. For example, similar to the discussion above regarding semantic graph 4907, filling semantic graph 4908 also includes classification and boundaries of various objects within the digital image. However, scene-based image editing system 106 generates fill semantic map 4908 based on fill modifications that indicate areas of the extended digital image frame of digital image 4902 to be filled. To illustrate, if the fill modification indicates an area to be filled of a frame including the expanded digital image 4902, the scene-based image editing system 106 generates a fill semantic map corresponding to the indicated area. In this case, the fill semantic map 4908 includes semantic classifications and semantic boundaries for the expanded portion of the digital image. In other examples, when the fill modification indication includes a region to be filled that replaces a portion of an existing digital image, then the scene-based image editing system 106 generates a fill semantic map that generates a segmented portion corresponding to the portion to be replaced with the corresponding semantic classification and semantic boundary. More details about semantic boundaries and classification are given below with respect to FIG. 54A.

Additionally, as further illustrated, fig. 49 shows a scene-based image editing system 106 that utilizes a generative image machine learning model 4910 to generate a modified digital image 4912. For example, FIG. 49 shows that scene-based image editing system 106 generates modified digital image 4912 from fill semantic map 4908. As previously described, scene-based image editing system 106 utilizes generating machine learning models to generate true fill semantic graphs. As shown, the scene-based image editing system 106 also utilizes the generative image machine learning model 4910 to generate realistic digital images from the filling semantic map 4908. In particular, the generative image machine learning model 4910 receives as input a filling semantic map 4908 and a digital image 4902. The fill semantic map 4908 and the digital image 4902 guide the generative image machine learning model 4910 to generate a modified digital image 4912.

As shown in modified digital image 4912, scene-based image editing system 106 fills the portion of the person depicted in digital image 4902 below the knee (which depicts the person only from the knee upwards). Results generated by scene-based image editing system 106 illustrate efficient, accurate, and high quality completion of digital image 4902 by completing semantic graph 4907 and generating modified digital image 4912 using fill semantic graph 4908.

Conventional systems utilize recent computational advances to modify digital images using various digital tools and models. To illustrate, conventional systems utilize computer-implemented models to crop, repair, and modify digital objects depicted in digital images. However, despite these advances, conventional systems still suffer from a number of technical drawbacks, particularly in terms of accuracy, efficiency, and flexibility in implementing the system in generating modified digital images.

For example, conventional systems are capable of cropping and repairing digital images, but these systems often produce inaccurate and impractical results. Similarly, conventional systems typically require a significant amount of user interaction and user interface. The user interaction and interfaces required to add or remove regions consume valuable computing resources and time. In addition, conventional systems provide limited functional flexibility in repairing digital images. For example, conventional systems seek to produce pixels that will mix with surrounding pixels, but fail to provide additional functional control regarding the features or factors that implement the underlying model.

As described above, the scene-based image editing system 106 may improve the accuracy of conventional systems. For example, scene-based image editing system 106 utilizes fill semantic map 4908 and digital image 4902 to create a true filled digital image (e.g., modified digital image 4912). In particular, scene-based image editing system 106 generates a digital image that accurately and realistically depicts a target object by generating a fill semantic map 4908 (using a generation machine learning model) and generating a modified digital image 4912 from the fill semantic map (using a generation machine learning model). Indeed, in some embodiments, the scene-based image editing system 106 adjusts the diffuse neural network to more accurately generate the filling semantic map and accurately derive the digital image. Thus, as shown in FIG. 49, the scene-based image editing system 106 may accurately expand digital images or perform other modifications, such as filling in or modifying object/scene textures.

In addition, in one or more implementations, scene-based image editing system 106 also improves efficiency over conventional systems. For example, the scene-based image editing system 106 may generate modified digital images with reduced user interaction, user interface, and time-consuming correction/editing associated with many conventional systems. For example, scene-based image editing system 106 generates semantic graph 4907, populates semantic graph 4908, and modified digital image 4912 using various machine learning models. Thus, the scene-based image editing system 106 saves valuable time and computing resources by intelligently generating the modified digital image 4912. In addition, scene-based image editing system 106 may provide a streamlined interface for intuitive and simplified flow to generate fill semantic map 4908 and/or modified digital image 4912. In particular, the scene-based image editing system 106 may provide a unified user interface for generating the fill semantic graph and the modified digital image to reduce or eliminate the need to alternate between multiple interfaces used by conventional systems.

In addition to the above, scene-based image editing system 106 also increases functional flexibility. For example, the scene-based image editing system 106 allows a client device to expand frames of a digital image, remove objects depicted within the digital image, or modify textures within portions of the digital image. Furthermore, the scene-based image editing system 106 provides improved functionality for a user interface that allows unique control of the implementation model, such as the number of filler segmentation maps to be generated, the number of modification images to be generated, the number of layers to be utilized (e.g., within a diffuse neural network), for adjusting the texture of the generated model, and for guiding semantic editing inputs to generate filler semantic maps and/or modify digital images.

As described above, the scene-based image editing system 106 utilizes various types of machine learning models. For example, FIG. 50A illustrates a scene-based image editing system 106 that utilizes a diffuse neural network (also referred to as a "diffuse probability model" or a "denoising diffuse probability model") to generate a filled semantic graph in accordance with one or more embodiments. In particular, FIG. 50A shows a diffuse neural network that generates a fill semantic graph 5024, while a subsequent graph (FIG. 51) shows a diffuse neural network that generates a modified digital image conditioned on the fill semantic graph 5024. For example, in one or more embodiments, the scene-based image editing system 106 utilizes a diffusion model (or a diffusion neural network), as described in J. H0, a. Jain, pabbreel, denoising diffusion probability model, arXiv:2006:11239 or by jialiming song, et al. In the denoising diffusion implicit model. In ICLR,2021, the entire contents of which are incorporated herein by reference.

As described above, the scene-based image editing system 106 utilizes a diffuse neural network. In particular, the dispersive neural network receives a digital image as an input and adds noise to the digital image through a series of steps. For example, the scene-based image editing system 106 maps digital images to potential space via a diffuse neural network using a fixed markov chain that adds noise to the data of the digital images. Furthermore, each step of the fixed Markov chain depends on the previous step. Specifically, at each step, the fixed Markov chain adds Gaussian noise with variance that produces a diffuse representation (e.g., a diffuse latent vector, a diffuse noise map, or a diffuse inversion). After each step of the diffuse neural network adds noise to the digital image, the scene-based image editing system 106 utilizes the trained de-noised neural network to recover the raw data from the digital image. Specifically, the scene-based image editing system 106 reverses the processing of the fixed Markov chain with a denoising neural network having a length T equal to the length of the fixed Markov chain.

As previously described, in one or more embodiments, the scene-based image editing system 106 generates a fill semantic graph. FIG. 50A illustrates a scene-based image editing system 106 that trains a diffuse neural network to generate a fill semantic graph 5024. In particular, fig. 50A illustrates that the scene-based image editing system 106 analyzes the input fill semantic graph 5002 to generate a fill semantic graph 5024 (e.g., a reconstruction of the input fill semantic graph 5002). In particular, scene-based image editing system 106 utilizes a diffusion process during training to generate various diffusion representations to pass the final diffusion representation to the denoising network as vertices. During training, the scene-based image editing system 106 oversees the output of each de-noised neural network layer based on the diffusion representations generated during the diffusion process.

As shown, fig. 50A illustrates a scene-based image editing system 106 that generates eigenvectors 5006 from an input fill semantic graph 5002 using an encoder 5004. In one or more embodiments, the encoder 5004 is a neural network (or one or more layers of a neural network) that extracts features related to the input pad-semantic graph 5002, e.g., in this case, related to objects (human sub-parts) depicted within the input pad-semantic graph 5002. In some cases, the encoder 5004 includes a neural network that encodes features from the input pad-semantic graph 5002. For example, the encoder 5004 may include a particular number of layers including one or more fully connected and/or partially connected layers that identify and represent characteristics/features of the input fill semantic graph 5002 by potential feature vectors. Thus, hidden vector 5006 includes a hidden (e.g., non-human interpretable) vector representation of input pad-semantic graph 5002. In particular, eigenvectors 5006 include a numerical representation of the features of the input fill semantic graph 5002.

Further, fig. 50A shows a diffusion process 5008 of diffusing the neural network. Specifically, fig. 50A shows the diffusion of eigenvectors 5006. At each step of the diffusion process 5008 (based on a fixed markov chain), the scene-based image editing system 106 generates a diffusion representation via a diffusion neural network. For example, the diffusion process 5008 adds noise to the diffusion representation at each step until the diffusion representation is diffused, corrupted, or replaced. In particular, scene-based image editing system 106 utilizes a fixed markov chain to add gaussian noise to the signals of eigenvectors via diffusion process 5008. The scene-based image editing system 106 may adjust the number of diffusion steps (and the corresponding number of denoising layers in the denoising step) in the diffusion process 5008. Furthermore, while fig. 50A illustrates that diffusion process 5008 is performed using eigenvectors 5006, in some embodiments scene-based image editing system 106 applies diffusion process 5008 to pixels of input fill semantic map 5002 (without generating eigenvector representations of input fill semantic map 5002).

As just described, the diffusion process 5008 adds noise at each step of the diffusion process 5008. In effect, at each diffusion step, the diffusion process 5008 adds noise and produces a diffusion representation. Thus, for a diffusion process 5008 having five diffusion steps, the diffusion process 5008 produces five diffusion representations. As shown in fig. 50A, the scene-based image editing system 106 generates a final diffuse representation 5010. Specifically, fig. 50A in the final diffusion representation 5010 includes random gaussian noise after the diffusion process is completed. As part of the diffusion neural network, the denoising neural network 5012a denoises the final diffusion representation 5010 (e.g., reverses the process performed by the diffusion process 5008 to add noise to the diffusion representation).

As shown, fig. 50A illustrates a denoising neural network 5012a that generates a first denoised representation 5014, the first denoised representation 5014 partially denoise the final diffuse representation 5010 by generating the first denoised representation 5014. Further, fig. 50 also shows a noise reduction neural network 5012b that receives the first noise reduction representation 5014 for further noise reduction to generate a second noise reduction representation 5016. In particular, in one or more embodiments, the number of noise reduction steps corresponds to the number of diffusion steps (e.g., the number of diffusion steps of a fixed markov chain). Further, fig. 50A illustrates that the scene-based image editing system 106 processes the second denoised representation 5016 with the decoder 5018 to generate the fill semantic map 5024.

In one or more implementations, the scene-based image editing system 106 trains the denoising neural network in a supervised manner based on the diffusion representation generated at the diffusion process 5008. For example, the scene-based image editing system 106 compares the diffusion representation at the first step of the diffusion process 5008 with the final de-noised representation generated by the final de-noised neural network (with a loss function). Similarly, the scene-based image editing system 106 may compare (with the loss function) the second diffusion representation from the second step of the diffusion process 5008 with the penultimate denoising representation generated by the penultimate denoising neural network. Thus, the scene-based image editing system 106 can teach or train the denoising neural network to denoise random gaussian noise and generate realistic digital images using the corresponding diffusion representations of the diffusion process 5008.

The scene-based image editing system 106 may use various neural network formulas for the denoising neural networks 5012a,5012b. For example, in some implementations, the scene-based image editing system 106 utilizes U-Net for time conditions, such as described in U-Net: convolution networks for biomedical image segmentation by O.Ronneberger, p.Fischer and T.Brox, MICCAI (3), vol.9351in Computer Science, p.234-250 (2015), which are incorporated herein by reference in their entirety. In one or more embodiments, the scene-based image editing system 106 utilizes the diffusion architecture and training method described by R.Rombach, A.Blattmann, D.Lorenz, P.Esser and b.ommer in high resolution image synthesis using potential diffusion models, arxiv:2112.10752v2, which is incorporated herein by reference in its entirety.

In addition to the training illustrated in FIG. 50A, FIG. 50B illustrates a scene-based image editing system 106 generating a filling semantic graph (e.g., during inference) in accordance with one or more embodiments. In particular, the scene-based image editing system 106 utilizes random noise input 5011 (along with the adjustments) to generate a filled semantic graph from the semantic graph 5019 (i.e., the incomplete semantic graph relative to the expanded frames of the un-cropped digital image).

As shown, the scene-based image editing system 106 adjusts the denoising neural network 5012a and the denoising neural network 5012b. For example, fig. 50B illustrates the scene-based image editing system 106 performing actions 5020 that adjust each layer of the denoised neural network 5012a and the denoised neural network 5012B. To illustrate, the reconciliation level of the neural network includes providing context to the network to guide the generation of the fill semantic graph 5024. For example, the tuning layer of the neural network includes (1) converting tuning inputs (e.g., digital images, semantic graphs 5019, and semantic editing inputs) into at least one of a tuning representation to be combined with a denoising representation; and/or (2) utilize an attention mechanism that concentrates the neural network on a particular portion of the input and adjusts its predictions (e.g., outputs) based on the attention mechanism. In particular, for a denoising neural network, the tuning layer of the denoising neural network includes providing alternative inputs (e.g., digital images, semantic graphs 5019, and semantic editing inputs) to the denoising neural network. In particular, the scene-based image editing system 106 provides alternative inputs to provide guidance (e.g., a denoising process) to remove noise from the diffuse representation. Thus, the scene-based image editing system 106 of the denoising neural network adjusts the layer to act as a guard rail to allow the denoising neural network to learn how to remove noise from the input signal and produce a clean output.

For example, fig. 50B shows that prior to adjustment, the scene-based image editing system 106 utilizes the encoder 5022 to analyze adjustment information (such as a digital image, one or both of a semantic graph 5019 and a semantic editing input for the digital image). In some embodiments, the encoder 5022 generates an adjustment representation from the adjustment information. In one or more embodiments, the scene-based image editing system 106 provides options to the user of the client device that indicate semantic editing input. In particular, the scene-based image editing system 106 provides the user of the client device with the semantic graph 5019 and options indicating the semantic classification/boundaries. For example, the scene-based image editing system 106 provides options indicating the classification and/or location of the person depicted in the semantic graph 5019 that fills in below the knees. Furthermore, in one or more embodiments, scene-based image editing system 106 provides additional filled semantic graphs that include additional semantic classifications within additional semantic boundaries. For example, the scene-based image editing system 106 allows a user of a client device to indicate a plurality of semantic classifications within a plurality of semantic boundaries.

Further, in response to the user of the client device providing the indication, the scene-based image editing system 106 processes the semantic editing input for adjustment. Thus, the scene-based image editing system 106 may adjust the network based on the digital image, a combination of one or more of the semantic graph 5019 and/or the semantic editing input of the digital image. More details regarding the scene-based image editing system 106 that provides the user of the client device with an option to indicate semantic editing input are presented below.

Further, the scene-based image editing system 106 utilizes the adjustment representation generated from the encoder 5022 to perform an action 5020 of adjusting the layers of the denoised neural network 5012a and the denoised neural network 5012 b. Specifically, adjusting the layers of the network includes modifying the inputs in the layers of the denoised neural network to combine with the random noise input 5011. For example, the scene-based image editing system 106 combines (e.g., concatenates) the adjusted representations generated from the encoder 5022 at different combinations of the de-noised neural network. For example, the scene-based image editing system 106 combines one or more adjustment vectors with the input 5011 of random noise, the first denoised representation 5014, or other intermediate denoised representations analyzed by the denoising layer. Thus, the denoising process considers the semantic graph 5019 and the digital image to generate a denoised representation.

As shown, fig. 50B illustrates the scene-based image editing system 106 utilizing a second noise reduction representation 5016 (e.g., the final noise reduction representation illustrated in fig. 50, although other embodiments may include additional steps and additional noise reduction representations). For example, the scene based image editing system 106 utilizes the decoder 5018 to process the second de-noised representation 5016. Specifically, the scene-based image editing system 106 generates the fill semantic graph 5025 from the second denoised representation 5016 via the decoder 5018. As shown, fig. 50B shows a fill semantic graph 5025, where fill semantic tags are down from the knee, while the semantic graph shows only human semantic tags up from the knee. Thus, scene-based image editing system 106 utilizes a diffuse neural network to generate a fill semantic graph 5025.

Although fig. 50B illustrates generating the filled semantic graph 5025 with a denoising architecture, in one or more implementations, the scene-based image editing system 106 utilizes a cross-attention layer that analyzes the representation from the encoder 5022 and the denoised representation (intermediate representation of UNet).

As described above, fig. 51 illustrates a scene-based image editing system 106 that utilizes a de-noised neural network to generate a modified digital image 5124 that is conditioned by a filled semantic map 5121 (e.g., a completed semantic map). Similar to the discussion above, the scene-based image editing system 106 utilizes a diffuse neural network for training purposes to generate the modified digital image 5124. In particular, during training, scene-based image editing system 106 utilizes an encoder to analyze the input digital image (rather than the semantic graph) and generate eigenvectors. Furthermore, scene-based image editing system 106 utilizes a diffusion process to process the eigenvectors and generate a diffuse representation at each step (depending on the length of the fixed Markov chain). Further, the scene-based image editing system 106 generates a final diffuse representation of the input digital image (e.g., the expected output during training). The scene-based image editing system 106 is trained by comparing the diffusion representation generated by the diffusion process with a corresponding de-noised representation generated by the de-noised neural network layer.

As shown, FIG. 51 illustrates a scene-based image editing system 106 that utilizes a trained diffusion network to generate a complete digital image. In particular, fig. 51 illustrates a scene-based image editing system 106 that utilizes a fill semantic map 5121 and a digital image (e.g., digital image 4002 discussed in fig. 49) as tuning inputs to a de-noising neural network. Furthermore, the scene-based image editing system 106 utilizes binary masking to indicate the region to be filled to the de-noised neural network. The scene based image editing system 106 may also utilize additional editing input to the fill semantic graph 5121. In particular, the scene-based image editing system 106 provides options to a user of the client device to provide color or texture patches as adjustment inputs. Based on the provided color or texture tiles, the scene-based image editing system 106 then adjusts each layer of the de-noised neural network using the fill semantic map 5121 and the digital image with the color or texture tiles as adjustment inputs.

As further shown, FIG. 51 illustrates a scene-based image editing system 106 that utilizes random noise input 5110 with a denoising neural network 5112a (e.g., first denoising layer). Similar to the discussion in fig. 41B, the scene-based image editing system 106 utilizes the de-noised neural network 5112a to reverse the diffusion process and generate a modified digital image 5124. Fig. 51 also shows a denoising neural network 5112a that generates a denoising representation 5114 and further denoises the denoising representation with a denoising neural network 5112b (denoising here also corresponds to the number of steps in the diffusion process).

Further, FIG. 51 shows scene-based image editing system 106 performing act 5120. For example, act 5120 includes adjusting a layer of the denoised neural network 5112a and the denoised neural network 5112 b. In particular, FIG. 51 illustrates that scene-based image editing system 106 performs an action 5120 of making an adjustment using fill semantic graph 5121. FIG. 51 shows an encoder 5122 analyzing a fill semantic map 5121 (generated in FIG. 50B), a digital image, and a binary mask. By adjusting the layers of the denoised neural network 5112a and the denoised neural network 5112b with the filling semantic map 5121 and the digital image, the scene-based image editing system 106 accurately generates the modified digital image 5124.

As described above, the scene-based image editing system 106 generates a modified digital image 5124 by adjusting the layers of the denoising neural network. In particular, fig. 51 shows that scene-based image editing system 106 receives second de-noised representation 5116 via decoder 5118 and generates modified digital image 5124. Specifically, the modified digital image 5124 accurately depicts the leg of a person from the knee down. Thus, fig. 51 illustrates that the scene-based image editing system 106 generates a padded digital image corresponding to the expanded frame of the digital image 4002 illustrated in fig. 49 through the diffuse neural network.

Also as described above, in one or more implementations, the scene-based image editing system 106 generates a modified digital image using the input texture. For example, FIG. 43 illustrates utilizing input textures to generate a modified digital image utilizing a diffuse neural network in accordance with one or more embodiments. Similar to fig. 50A and 51, the scene-based image editing system 106 trains the diffuse neural network. In particular, similar to the discussion above, during training, the scene-based image editing system 106 utilizes the expected output as input to the diffuse neural network and utilizes the diffuse representation to supervise training of the denoising layer of the neural network.

The scene based image editing system 106 receives an indication to replace pixel values within the digital image 5215 and the input texture 5211 with a trained de-noised neural network. In particular, scene based image editing system 106 replaces the indicated pixel values with input texture 5211. In particular, the input texture 5211 includes a sample texture that depicts a pattern selected by a user of the client device. To illustrate, the scene-based image editing system 106 modifies the input digital image 5215 with a particular selection pattern by locating the input texture 5211 to a relevant region of the input digital image 5215. For example, the scene-based image editing system 106 receives the input texture 5211 from a client device or from a selection of predefined texture options.

In one or more embodiments, the scene-based image editing system 106 utilizes a diffuse neural network to generate a modified digital image 5216 that includes the input texture 5211. Similar to the discussion above, the scene-based image editing system 106 also utilizes a diffuse neural network to replace regions in the digital image 5215 with the input texture 5211. In some implementations, the scene-based image editing system 106 isolates texture modifications to certain portions of the digital image within the diffuse neural network. In particular, fig. 52 shows that the scene-based image editing system 106 generates masking of relevant portions of the digital image 5215 that is replaced with the input texture 5211. For example, in fig. 52, the relevant portion includes clothing of the person depicted in the digital image 5215. Further, the scene-based image editing system 106 generates masking via the segmented image neural network for the garment based on input or query selections by the user of the client device. This will be discussed in more detail below (e.g., with respect to fig. 53). As described above, the scene-based image editing system 106 may train the diffusion neural network within the denoising process 5206 by reconstructing the input digital image and supervising the diffusion neural network with the diffusion representation generated from the steps of the diffusion process.

Fig. 52 illustrates an act 5208 of the scene-based image editing system 106 performing an adjustment of each denoising layer in the denoising process 5206 based on processing the random noise input 5213. In particular, the scene based image editing system 106 utilizes the encoder 5209 to process the input texture 5211 and the input digital image 5215. In particular, the scene-based image editing system 106 utilizes the encoder 5209 to generate one or more adjustment vectors that are combined with the input (i.e., the noise reduction representation) at each noise reduction layer. Furthermore, although not shown in fig. 52, in one or more embodiments, scene-based image editing system 106 also provides options for user input. In particular, the user input includes modifications to the masking portion of the digital image 5215, or modifications to the input texture 5211. For example, modifying includes changing the input texture 5211 or modifying portions of the mask included in the input digital image 5215. Thus, the scene-based image editing system 106 performs an act 5208 of adjusting each layer of the de-noised neural network with user input, among other adjustment inputs.

As described above, the scene-based image editing system 106 may apply denoising and adjustment processes such that textures are localized to certain portions or regions of the modified digital image. In fact, globally applying an adjustment vector based on the input texture 5211 throughout the denoising process may result in adding texture to the entire modified digital image. The scene-based image editing system 106 may locate textures to specific areas (i.e., areas of clothing) by applying a mask between the denoising layers and utilizing an unconditional representation of the input digital image for areas outside of the mask.

For example, fig. 52 illustrates a scene-based image editing system 106 that applies a first noise reduction layer 5218a to generate a first noise reduction representation 5212 a. The scene-based image editing system 106 applies the mask 5220 to the first denoised representation 5212a (i.e., isolates areas within the garment and omits areas outside the garment). Thus, the scene-based image editing system 106 generates a first masked de-noised representation 5222a. Thus, the first masked de-noised representation 5222a represents a partial de-noised signal comprising the input texture 5211 for the clothing region.

The scene-based image editing system 106 then combines the first masked denoised representation 5222a with the untextured image representation 5210a. In fact, as shown, the scene-based image editing system 106 identifies or generates the non-texture image representation 5210a as a representation of the non-texture digital image 5215. The scene based image editing system 106 may generate the non-texture image representation 5210a in various ways. For example, in some implementations, the scene-based image editing system 106 utilizes the output of the diffusion layer at the same respective denoising level as the non-texture image representation 5210a. In some implementations, the scene-based image editing system 106 applies the denoising layer without a condition vector to generate the texture-free image representation 5210a.

As described above, in one or more embodiments, the scene-based image editing system 106 utilizes a segmented image neural network. For example, FIG. 53 illustrates a scene based image editing system 106 that generates modified digital images based on segmentation of an input image in accordance with one or more embodiments. Specifically, fig. 53 shows a digital image 5300 depicting a person holding two items, one in each hand. The scene-based image editing system 106 generates a modified digital image 5318 that removes both items from the person's hand and fills the removed portion of the digital image 5300.

As shown, FIG. 53 illustrates a segmentation of two items in a human hand by the scene-based image editing system 106. For example, fig. 53 shows that the scene-based image editing system 106 utilizes the user selection 5302 or the segmentation machine learning model 5304 to select and segment two items in a human hand. In particular, the segmentation machine learning model 5304 is a neural network that segments or selects objects in a digital image. Thus, as shown, the segmentation machine learning model 5304 generates a segmented (or masked) digital image 5306.

The scene-based image editing system 106 may identify the user selection 5302 in various ways. In some implementations, the scene-based image editing system 106 provides editing tools to the client device and identifies areas to remove based on user interactions with the editing tools. In particular, the client device may identify a user interaction (e.g., a click, finger press, or drawing) to indicate the portion of the scene-based image editing system 106 to segment.

In some embodiments, the scene-based image editing system 106 utilizes the segmentation machine learning model 5304 to select items. The segmentation machine learning model 5304 may also segment or select objects based on user input (or no user input). For example, in some implementations, the scene-based image editing system 106 provides the user of the client device with the option to query (via selection or natural language input) the portion to be selected. For example, a user of the client device queries the scene-based image editing system 106 to select the "package" shown in the digital image 5300. Based on the query, the scene-based image editing system 106 utilizes the segmentation machine learning model 5304 to segment the portion of the digital image 5300 depicting the "package". In other embodiments, the user of the client device queries the scene-based image editing system 106 to select "pants" or "shirts" or "hair.

In some implementations, the scene-based image editing system 106 utilizes the user selections 5302 and the segmentation machine learning model 5304 to segment or select objects in the digital image. For example, the scene-based image editing system 106 can utilize the segmentation machine learning model 5304 to analyze user input (e.g., clicks, drawings, contours, or natural language) of the digital image to generate the segmentation. For example, the scene-based image editing system 106 may utilize the segmentation method described in U.S. application Ser. No.16/376, 704"UTILIZING INTERACTIVE DEEP LEARNING TO SELECT OBJECTS IN DIGITAL VISUAL MEDIA" filed on 5/4/2019, the entire contents of which are incorporated herein by reference.

As previously described, scene-based image editing system 106 may also generate semantic graphs. With respect to fig. 53, scene-based image editing system 106 generates semantic graph 5310 using segmented (or masked) digital image 5306. In particular, scene-based image editing system 106 utilizes segmentation machine learning model 5304 to assign labels to pixels within digital image 5300.

As further illustrated, fig. 53 shows a scene-based image editing system 106 that generates a semantic graph 5310 from a segmented (or masked) digital image 5306 using a semantic graph model 5308 (already discussed above). In particular, the scene-based image editing system 106 generates a semantic graph 5310 for masking portions (i.e., filling regions) from the segmented digital image 5306. According to the embodiment shown in FIG. 53, scene-based image editing system 106 also performs act 5312. In particular, act 5312 includes providing a user of the client device with an option to provide user input to direct generation of the fill semantic map 5314. In particular, the option of the user providing user input to direct the generation of the filled semantic graph 5314 from the semantic graph includes indicating semantic classifications/boundaries in the semantic graph 5310. For example, the scene-based image editing system 106 receives user input to guide the semantic graph and, with the generated semantic machine learning model 5313, the scene-based image editing system 106 generates the filled semantic graph 5314. Specifically, as described above, scene-based image editing system 106 generates filled-in semantic graph 5314 by directing the semantic graph (e.g., semantic editing input) as an adjustment input using the semantic graph and/or user input. For example, as discussed, the scene-based image editing system 106 utilizes semantic graphs and/or user inputs to direct the semantic graphs to adjust each layer of the de-noised neural network.

Thus, for the scene-based image editing system 106 that provides options to provide user input to guide the semantic graph, the scene-based image editing system 106 provides options to indicate the direction and position of the hand under the removed item (e.g., package). This will be discussed in further detail below.

In response to scene-based image editing system 106 performing act 5312 of providing an option to provide a user input to guide generation of the stuffed semantic graph 5314, scene-based image editing system 106 also utilizes generated semantic machine learning model 5313 to generate stuffed semantic graph 5314 based on the user input. As shown in fig. 53, act 5312 may occur iteratively to regenerate the filling semantic map 5314 until the user of the client device is satisfied with the filling semantic map 5314.

Further, fig. 53 illustrates a scene-based image editing system 106 that utilizes a generative image machine learning model 5316 (e.g., the generative image machine learning model 4910 discussed in fig. 49) to generate a modified digital image 5318. In particular, modified digital image 5318 shows digital image 5300, but without items in the person's hand, and the background and hand are populated by scene-based image editing system 106.

While fig. 53 illustrates an act 5312 of providing an option to a user of a client device to provide user input to direct the scene-based image editing system 106 to generate the fill semantic graph 5314, in one or more embodiments, the scene-based image editing system 106 does not provide the user input option. For example, the scene-based image editing system 106 generates the semantic graph 5310 and then generates the modified digital image 5318 by generating the image machine learning model 5316.

As described above, in some embodiments, the scene-based image editing system 106 provides a user interface to a user of a client device to generate a modified digital image. For example, fig. 54A illustrates a client device 5410 having a user interface 5408 for expanding frames of a digital image (e.g., despreading) in accordance with one or more embodiments. In particular, fig. 54A shows a selectable expansion option 5400 for "how to unlock an image" and a drop-down option indicating the manner in which to expand the digital image 5402. For example, the expanded digital image 5402 includes the padding modifications described above.

Further, the selectable expansion options 5400 include options indicating a percentage of the expanded digital image 5402 and a location of the digital image 5402 within the expanded frame 5404. For example, fig. 54A shows an instruction to expand the digital image 5402 by 150%. Further, fig. 54A also shows an instruction to position the digital image 5402 at the top center position of the extension frame 5404. In other examples, the locations of the digital images in the expanded frame 5404 include top left, top right, bottom left, bottom right, bottom center, middle left, middle right, and middle center. Further, the scene-based image editing system 106 can place the digital image 5402 within the extension frame 5404 and provide the user of the client device with the option to drag the digital image 5402 to a desired location.

In response to an indication of the parameters that expanded digital image 5402, scene-based image editing system 106 shows semantic graph 5406 alongside digital image 5402. Similar to the discussion above, the scene-based image editing system 106 generates a semantic graph 5406 using a semantic graph model. In addition, FIG. 54A also shows a semantic graph within the "top-center" expanded frame 5404. In addition, FIG. 54A shows that scene-based image editing system 106 provides semantic completion options 5412 (labeled "full semantics" in FIG. 54A) via a user interface. The semantic completion option shown in 5412 in FIG. 54A utilizes the process described above for generating a filled semantic graph. In particular, in response to selection of the semantic completion option 5412 in the user interface 5408, the scene-based image editing system 106 populates the extension frame 5404 based on the semantic map 5406 (e.g., the initial semantic map of the digital image 5402).

As described above, the scene based image editing system 106 receives a selection of the semantic completion option 5412. As shown, FIG. 54B illustrates that scene-based image editing system 106 receives a selection of semantic completion option 5412 to complete the semantic graph and generate a fill semantic graph 5414. Thus, comparing the semantic graph 5406 within the expanded frame 5404 shown in fig. 54A with the fill semantic graph 5414, fig. 54B shows the completion of the lower limb (e.g., below the knee) corresponding to the expanded frame of fig. 54B. For example, the filled-in semantic graph 5414 shows semantic markers for pants, ankles, and shoes/feet, while the digital image 5402 depicts only people from the knees upwards.

Further, FIG. 54C illustrates a user interface 5408 for generating a modified digital image from a fill semantic graph in accordance with one or more embodiments. For example, fig. 54C shows a digital image completion option 5418 (labeled "completion RGB"). In particular, in response to selection of the digital image completion option 5418, the scene-based image editing system 106 generates a modified digital image 5416 based on the fill semantic map 5414. Thus, as shown in fig. 54C, the scene-based image editing system 106 provides (via the user interface 5408) a display of the modified digital image 5416 (e.g., the completed digital image) along with the fill semantic map 5414 and the digital image 5402. Similar to the discussion in fig. 54B, the modified digital image 5416 shows a person from the knee down in the digital image 5402. In particular, as already mentioned, the scene-based image editing system 106 directs the generation of modified digital images 5416 with the fill semantic graph 5414.

As described above, the scene-based image editing system 106 provides options for selecting regions in the digital image for removal. For example, fig. 55A shows a scene-based image editing system 106 having a digital image 5500 and a masked digital image 5502, where the masked digital image 5502 has an area to be removed. In particular, the digital image 5500 depicts a person with a pocket in each hand, and the scene-based image editing system 106 provides a tool to select the portion to remove. For example, the scene-based image editing system 106 receives an indication to remove a package by clicking on a query that selects or specifies "package".

Fig. 55A further illustrates a scene-based image editing system 106 that utilizes a user interface 5512 of a client device 5514 to provide a diffusion step iteration option 5506 to specify the number of diffusion steps (e.g., sampling steps) for performing a fill modification. In response to receiving the specified number of diffusion iterations, the scene-based image editing system 106 may dynamically select a plurality of diffusion/denoising layers for generating a modified digital image from the masked digital image 5502 and the digital image 5500. In particular, the diffusion iteration includes a length of a fixed markov chain (e.g., a number of diffusion layers and/or denoising layers). In other words, the diffusion iteration represents the number of steps of forward diffusion and the number of steps of backward diffusion (denoising). Further, in some implementations, the scene-based image editing system 106 takes only a few seconds to generate the modified digital image or high resolution version of the digital image 5500.

In addition, fig. 55A also shows a high resolution option 5504 to specify the number of diffusion steps for generating a high resolution version of the modified digital image. For example, based on the adjustment from the fill semantic map, the diffuse neural network generates a modified digital image of the low resolution version. Specifically, in response to an indication to generate a high resolution version of the digital image, the diffusion neural network receives as input the generated modified digital image (low resolution) and performs the indicated number of diffusion iterations to generate the high resolution version of the modified digital image.

Although fig. 55A illustrates various user interface options (e.g., diffusion step iteration option 5506 and high resolution option 5504), in one or more embodiments, the scene-based image editing system 106 provides digital image 5500 and masked digital image 5502 via user interface 5512 of client device 5514, with no additional user interface elements in the region to be removed. In particular, the scene-based image editing system 106 provides an intuitive and easy-to-use interface with a client device 5514 that displays only the digital image 5500 and the masked digital image 5502. In doing so, a user of client device 5514 may click/tap/select a portion on digital image 5500 via user interface 5512 and scene-based image editing system 106 intelligently detects and performs actions such as generating a modified digital image. For example, in response to detecting a selection of an object in a human hand, the scene-based image editing system 106 automatically generates a fill semantic map and fills in the region defined by the object. Similarly, in response to selection of a boundary or frame of the expanded digital image, scene-based image editing system 106 automatically generates a fill semantic map and fills the expanded frame. Thus, in some embodiments, the scene-based image editing system 106 eliminates user interface elements.

Further, fig. 55B shows example results of the scene-based image editing system 106 removing items from the input digital image 5508. In particular, fig. 55B illustrates that an input digital image 5508 is received as input by the scene-based image editing system 106 and, in response to a user interaction, a modified digital image 5510 is generated that conforms to the user interaction. In particular, fig. 55B illustrates a precise and high quality modification of the input digital image 5508 to fill the hand and background portion behind the removed bag.

In one or more embodiments, scene-based image editing system 106 intelligently utilizes the features discussed in FIGS. 49-55B to perform fill modification. For example, if the user of the client device uploads the cropped digital image to an existing digital image being edited in the image editing application, the scene-based image editing system 106 intelligently determines modifications. In particular, the scene-based image editing system 106 determines a percentage of digital frames of the extended uploaded cropped digital image based on the uploaded cropped digital image that is not aligned with the bottom frame of the existing digital image. In addition, the scene-based image editing system 106 generates fill modifications to fill the extension (e.g., complete the person/object to be cropped to the digital image such that the person/object is aligned with the boundary of the new digital image, and then combine/compose the completed person with the new digital image). In addition, the scene-based image editing system 106 intelligently provides multiple outputs of digital images generated based on the uploaded cropped digital images. In particular, the plurality of outputs includes different variations of the scene-based image editing system 106, the scene-based image editing system 106 generating fill modifications for the extended digital image frames.

Furthermore, in one or more embodiments, scene-based image editing system 106 intelligently detects occlusions or impairments in the digital image. For example, the scene-based image editing system 106 detects and removes occlusions/impairments covering a portion of a person depicted within the digital image. In particular, the scene-based image editing system 106 removes the obscuration/impairments and generates a fill modification to replace the removed obscuration/impairments. For example, the scene-based image editing system 106 intelligently removes portions of the uploaded digital image and populates the removed portions.

Further, in one or more embodiments, scene-based image editing system 106 intelligently detects textures within the digital image and recommends alternative textures. For example, the scene-based image editing system 106 detects clothing depicted within the digital image and extracts textures of the clothing. In particular, the scene-based image editing system 106 then provides a recommendation of texture options to replace the current texture shown within the digital image.

Further, in one or more embodiments, the scene-based image editing system 106 intelligently detects other portions within the digital image and recommends options to replace the detected portions. For example, the scene-based image editing system 106 detects clothing, background, hair, and other accessories (e.g., purses, backpacks, hats, glasses, etc.). Specifically, the scene-based image editing system 106 detects portions of the digital image and provides a plurality of recommendations for each detected portion to replace the detected portion. Thus, for each of the embodiments described above, the scene-based image editing system 106 intelligently and automatically performs fill modifications or fill recommendations to the user of the client device in real-time or near real-time (e.g., a few seconds).

As briefly described above, in one or more implementations, the scene-based image editing system 106 generates a modified digital image by repairing an indicated portion of the digital image. Specifically, in one or more embodiments, scene-based image editing system 106 performs partial human repair using a specially trained generated machine learning model to accommodate repair of human objects from digital images. For example, for digital images depicting a person having an object overlaying (e.g., occluding or occluding) a portion of the person, the scene-based image editing system 106 performs the repair of the person using the person's repair GAN. In particular, the scene-based image editing system 106 generates a modified digital image by utilizing the structured guide map and the digital image depicting the person.

Further, in one or more embodiments, scene-based image editing system 106 performs human repair by utilizing two-dimensional modulation to generate a modified digital image. For example, the scene-based image editing system 106 generates visual appearance codes from digital images depicting a person using an encoder. In addition, the scene-based image editing system 106 also utilizes a parametric neural network to generate local appearance feature tensors, including spatially varying scaling and shifting tensors. In other words, the scene-based image editing system 106 modulates the structural code generated from the structural guide map at the human repair GAN. Two-dimensional modulation helps produce accurate and high quality digital images of human repair.

Further, in one or more embodiments, the scene-based image editing system 106 generates a digital image of the human repair by utilizing a hierarchical encoder. For example, scene-based image editing system 106 utilizes a layered encoder to generate structural encoding and visual appearance encoding. In particular, scene-based image editing system 106 utilizes a layered encoder that includes multiple downsampling and upsampling layers connected via a skip connection (e.g., a skip connection for matching resolution between layers). By utilizing a layered encoder, scene-based image editing system 106 can preserve nuances and details in generating a digital image of a human repair.

Further, as described above, in one or more embodiments, the scene-based image editing system 106 utilizes a structure guide map to generate a structural code. For example, the structure guide map includes at least one of a key point map, a dense pose map, a segmentation map, or a digital image. Specifically, in some embodiments, the scene-based image editing system 106 utilizes separate neural network branches to generate a segmentation map of the digital image for further use as a structure-directing map. For example, the scene-based image editing system 106 generates a filled segmentation map of a digital image depicting a person having an unclassified region corresponding to a region to be repaired.

Furthermore, in other embodiments, scene-based image editing system 106 utilizes multiple machine learning models for human repair and background repair. For example, the scene-based image editing system 106 uses a background GAN in addition to the human repair GAN. In particular, the scene-based image editing system 106 generates a modified background portion of the digital image with the background GAN and generates modified pixels of the region of the person depicted within the digital image with the human repair gan.k.

In addition, the scene-based image editing system 106 may train the human repair GAN. For example, in some implementations, the scene-based image editing system 106 trains human repair GANs with a combination of partial reconstruction loss and countermeasures loss. Indeed, because occlusion images often cannot include underlying real information after occlusion, scene-based image editing system 106 may utilize partial reconstruction losses, where the measurement of losses is based on the portion of the person outside of the area for repair.

As described above, the scene-based image editing system 106 performs actions for generating modified digital images. For example, scene-based image editing system 106 utilizes multiple machine learning model instances to generate a modified digital image by performing partial human repair. In particular, FIG. 56 illustrates an overview of a scene-based image editing system 106 that performs partial human repair in accordance with one or more embodiments. For example, the scene-based image editing system 106 generates a realistic and accurate modified digital image 5604 from the input digital image 5600 by inpainting (inpainting) a portion of the input digital image 5600. In particular, the scene-based image editing system 106 analyzes indications of areas from a user of the client device for repair, such as segmented portions or indications of removal of objects that obstruct people depicted within the digital image. In addition, the indication from the user of the client device includes at least one of an indication to remove an object (e.g., occlude a person) from the digital image, to expand the digital image frame (e.g., the expanded portion needs to fill the person), or to remove a flaw from an area of the person.

As shown, FIG. 56 illustrates that the scene-based image editing system 106 receives an input digital image 5600 depicting a person (e.g., from an upload or selection at an image editing application). For example, the scene-based image editing system 106 determines a depiction of a person within the input digital image 5600. For example, the depiction of the person may include showing an input digital image 5600 of the person only from the chest up or from the chest down or any other sub-portion of the person. Further, the depiction of the person may include an input digital image 5600 having a frame that encompasses the entire body of the person, yet having an occlusion or hole that covers a portion of the person. For example, the input digital image 5600 depicts a person holding a towel that blocks a portion of the person.

As shown, for the input digital image 5600, the scene-based image editing system 106 determines an area to repair corresponding to a depiction of a person. In particular, the area of the person includes a portion of the input digital image 5600 that either covers only a portion of the depicted person or includes a portion of the input digital image 5600 that overlaps a portion of the depicted person and another portion of the input digital image 5600. Specifically, the input digital image 5600 depicts a person, and also depicts a person holding an object, in which case the area to be colored by the person includes the object held by the person.

Also as shown, fig. 56 illustrates a scene-based image editing system 106 that utilizes a human repair GAN 5602 to generate a modified digital image 5604. As described above, the scene-based image editing system 106 utilizes various generation models to generate the realistic features. The scene-based image editing system 106 trains the human repair GAN 5602 exclusively for human repair. Further details regarding training human repair GAN 5602 are given below in the description of fig. 64. Further, details regarding specific architectural components of the human repair GAN 5602 are given below in the description of fig. 54.

Further, as further shown, fig. 56 shows that the scene-based image editing system 106 generates a modified digital image 5604 via the human repair GAN 5602. As previously discussed, the scene-based image editing system 106 determines a fill modification (e.g., act 4904 for determining the fill modification discussed in fig. 49), and the fill modification includes filling the region or, in other words, repairing the region. For example, the repair includes adding pixel values, replacing pixel values with new pixel values, or adding pixel values to an extended frame of the digital image, where the extended frame is unfilled. Thus, repairing includes adding pixel values to the digital image to fill in gaps, replacing objects, or adding to an extension of the digital image. Specifically, as shown in fig. 56, the scene-based image editing system 106 repairs the "hole" depicted in the input digital image 5600 by adding pixel values to generate a digital image depicting a realistic appearance of a portion of the input digital image 5600 subsequent to the "hole". Thus, human repair performed by scene-based image editing system 106 via human repair GAN 5602 includes actually filling the areas corresponding to the abdomen, hands, and pants of the person.

In one or more embodiments, the scene-based image editing system 106 utilizes a segmentation model to segment the input digital image 5600. In particular, the scene-based image editing system 106 utilizes a segmented image neural network (e.g., the segmented machine learning model 5304 discussed in fig. 53 or other segmented model discussed herein). For example, the scene-based image editing system 106 processes a selection or query of a user of the client device to select an object class within the input digital image 5600. Further, in response to the selection or query, the scene-based image editing system 106 generates segmentation masks. Thus, as shown, the input digital image 5600 includes relevant portions segmented by the scene-based image editing system 106.

Fig. 56 further illustrates a comparison of the modified digital image 5604 with respect to the original digital image 5606. As shown, the scene-based image editing system 106 actually depicts the portion of the person behind the towel. In particular, fig. 56 shows a scene-based image editing system 106 that performs realistic and accurate drawing of hands, abdomen, and swimwear worn by a person depicted within an original digital image 5606.

Conventional systems utilize recent computational advances to modify digital images using various digital tools and models. To illustrate, conventional systems utilize computer-implemented models to repair digital images. However, despite these advances, conventional systems still suffer from a number of technical drawbacks, particularly in terms of accuracy, efficiency, and flexibility in implementing the system in generating a repair digital image.

For example, FIG. 56 also shows the results of a conventional system of persons producing repairs. Specifically, fig. 56 shows a prior art digital image 5608. For example, prior art digital image 5608 also corresponds to input digital image 5600. In particular, prior art digital image 5608 depicts the same person as input digital image 5600, however, the result of repairing the portion behind the segmented region shows that conventional systems cannot accurately and truly generate a repaired person. In comparing the results of the modified digital image 5604 produced by the scene-based image editing system 106 with the prior art digital image 5608 produced by the conventional system, the prior art digital image 5608 erroneously incorporates elements of the water background in place of the person's abdomen and swimsuit. In addition, the prior art digital image 5608 also depicts in the original digital image 5606 the trace of the towel held by the person. For example, the remaining contour of the towel remains in the prior art digital image 5608 and the hand once holding the towel does not clearly depict the person's hand. Thus, unlike the prior art digital image 5608, the modified digital image 5604 accurately and truly displays the repaired person without leaking the background into the person and including the trace of the towel. Thus, conventional systems do not accurately repair a person, as shown in prior art digital image 5608. Furthermore, as previously discussed, conventional systems are also inefficient in terms of the number of user interactions used to repair the digital image (e.g., correct erroneous repair pixels).

The scene-based image editing system 106 may improve accuracy over conventional systems. For example, in one or more implementations, the scene-based image editing system 106 improves the accuracy of conventional systems by utilizing local appearance feature tensors (e.g., spatially varying parameters) to generate smooth, natural, realistic, and accurate human fixes. Unlike conventional machine learning models that utilize only global features of a digital image (e.g., feature representations that do not change or vary across pixels of the digital image), the use of local appearance feature tensors can add more nuances, depths, and details to generate the fill. Thus, in some implementations, the scene-based image editing system 106 generates the restored digital image more accurately.

In addition, the scene-based image editing system 106 improves the accuracy of conventional systems by utilizing human repair GAN 5602. In particular, in one or more implementations, the scene-based image editing system 106 utilizes a human repair GAN 5602 that is specifically trained to perform human repair. Thus, unlike conventional systems that are typically trained to fill areas, the scene-based image editing system 106 can utilize specialized training and specialized generative models to repair people depicted within digital images.

In addition, the scene-based image editing system 106 may improve accuracy by utilizing structural guidance maps. In particular, the scene-based image editing system 106 utilizes a structural guidance map, such as a key point map, a gesture map, or a segmentation map, to inform the human repair GAN 5602 to more accurately generate a modified digital image.

In addition to improving accuracy, the scene-based image editing system 106 improves the efficiency of conventional systems. For example, the scene-based image editing system 106 reduces the need for excessive user interaction for generating modified digital images. In effect, by intelligently and accurately repairing regions of the digital image, the scene-based image editing system 106 reduces the user interaction and user interface required to remove occlusions, correct the digital image, and generate an accurate depiction of the human object.

As shown, FIG. 57 illustrates additional details regarding generating a modified digital image from various encodings by scene-based image editing system 106 in accordance with one or more embodiments. For example, FIG. 57 illustrates a scene-based image editing system 106 that utilizes a structural guide map 5700 to facilitate generation of a modified digital image 5716. In particular, the structural guidance map 5700 includes structural representations of people depicted within the digital image 5702. For example, the structure guide map 5700 contains data points relating to the structure position, shape, and/or pose of a person depicted within the digital image 5702. Thus, the structure guide map 5700 can include a key point map, a gesture map, or a segmentation map. In one or more embodiments, the scene-based image editing system 106 utilizes a machine learning model (e.g., a segmentation machine learning model, a keypoint machine learning model, a shape machine learning model, or a pose machine learning model) to generate the structure guide map.

As shown in fig. 57, the scene-based image editing system 106 extracts structural data from the structural guide map 5700 using an encoder 5706. As further shown, scene-based image editing system 106 generates structural code 5710 from structural guide map 5700 using encoder 5706. In particular, the structural code 5710 includes a low-dimensional representation of high-dimensional features from a structural guide map.

In addition to generating structural code 5710, fig. 57 also shows that scene-based image editing system 106 utilizes encoder 5706 to generate visual appearance code 5712 from digital image 5702. For example, visual appearance code 5712 includes visual information regarding digital image 5702. In particular, visual appearance code 5712 includes visual features of a person depicted as a digital representation within digital image 5702.

As shown, fig. 57 illustrates a scene-based image editing system 106 that utilizes an encoder 5706 to generate structural code 5710 and visual appearance code 5712. Thus, scene-based image editing system 106 may utilize a single encoder to generate structural code 5710 (according to structural guide diagram 5700) and visual appearance code 5712 (according to digital image 5702). In some implementations, using a single encoder may increase the efficiency of the scene-based image editing system 106. In particular, the use of a single encoder (e.g., encoder 5706) reduces the amount of computational resources required for training and implementation.

In some embodiments, scene-based image editing system 106 utilizes multiple encoders. For example, fig. 57 also shows a scene-based image editing system 106 that optionally utilizes encoder 5708. Thus, in some embodiments, scene-based image editing system 106 generates visual appearance code 5712 from digital image 5702 using encoder 5708, while generating structural code 5710 from structural guide map 5700 using encoder 5706.

As further shown, fig. 57 illustrates a scene-based image editing system 106 that utilizes human repair GAN 5714 to generate a modified digital image 5716 from both structural encoding 5710 and visual appearance encoding 5712. As previously described, the scene-based image editing system 106 utilizes the human repair GAN to generate a realistic looking repaired person within the digital image 5702. Specifically, the scene-based image editing system 106 tunes the human repair GAN 5714 specifically for humans within the digital image to generate realistic repair humans.

As described above, in some implementations, the scene-based image editing system 106 generates local appearance feature tensors to improve the accuracy of the human repair. For example, fig. 58 illustrates additional details regarding adjusting human repair GAN 5814 with local appearance feature tensors in accordance with one or more embodiments.

In particular, fig. 58 shows a scene-based image editing system 106 that utilizes a structural guidance map 5802. As described above, the structure guide map 5802 may include a key point map, a posture map, a segmentation map, and a digital image. The key point map includes representations of points that make up the human figure. For example, a key point diagram defines the structure of a person having various points (e.g., points corresponding to a head, joint, member, etc.). Thus, the key point diagram includes key structural forms of the person depicted in the digital image 5804.

In some embodiments, scene-based image editing system 106 utilizes a gesture map. In particular, the gesture graph includes dense gesture representations. For example, in generating the pose map, scene-based image editing system 106 maps known pixels of digital image 5804 to a three-dimensional surface of a human being. The pose map includes a dense sequence of keypoint representations of the digital image 5804. In some embodiments, scene-based image editing system 106 utilizes a pose estimation algorithm, such as a dense pose, to generate a dense keypoint sequence representation. For example, scene-based image editing system 106 may utilize a pose estimation algorithm that maps pixels within a digital image depicting a person to a 3D surface of the person (e.g., by quantifying UV values).

Furthermore, in one or more embodiments, scene-based image editing system 106 utilizes a segmentation map (such as a semantic map as described above). For example, the segmentation map used as the structure guide map 5802 includes depictions of people within the digital image 5804 segmented according to various classifications. For example, the segmentation map includes a segmentation between a background and a person. In particular, the segmentation map also segments sub-parts of a person, such as jackets, pants, shoes, heads and hair worn by the person.

Furthermore, in some embodiments, scene-based image editing system 106 utilizes digital image 5804 (or a portion of digital image 5804) as a structural guide map 5802. For example, scene-based image editing system 106 utilizes digital image 5804, digital image 5804 depicts a person as structure guide map 5802. In particular, digital image 5804 contains structural information such as boundaries of persons and various sub-portions of persons reflected by different hues within digital image 5804. In this way, scene-based image editing system 106 utilizes the person from digital image 5804 as a structural guide map 5802.

As further shown, scene-based image editing system 106 generates structural code 5808 and visual appearance code 5810 from structural guide map 5802 and digital image 5804 using encoder 5806. In particular, scene-based image editing system 106 generates structural code 5808 and visual appearance code 5810 for each resolution of digital image 5804. Further details regarding generating structural code 5808 and visual appearance code 5810 are given below in the description of fig. 56-57.

In addition, fig. 58 shows a scene-based image editing system 106 that utilizes a parametric neural network 5812. For example, scene-based image editing system 106 generates local appearance feature tensors 5816 from visual appearance code 5810 using parametric neural network 5812. In particular, the parametric neural network 5812 includes scaling and shifting parameters (e.g., varying over various regions or pixel representations) that generate local applications. In particular, parametric neural network 5812 generates local appearance feature tensors 5816 (scaling and shifting parameters) for modulating the pattern blocks/layers of human repair GAN 5814. In other words, the local appearance feature tensor 5816 for the visual appearance code 5810 includes a scaled tensor and a shifted tensor of the local modified digital image 5804 such that the scaled tensor and the shifted tensor vary across different portions or regions. Thus, scene-based image editing system 106 generates feature tensors (scaling and shifting tensors) from visual appearance code 5810 at a particular resolution corresponding to a particular resolution of a particular style block using parametric neural network 5812. In addition, scene-based image editing system 106 generates different feature tensors (different locally varying scaling and shifting tensors) for the varying pattern blocks of human repair GAN 5814 at different resolutions using parametric neural network 5812.

For example, a feature tensor comprises a multi-dimensional array of values representing a feature or characteristic of underlying data (such as a digital image, structural representation, and/or visual appearance representation). The scene-based image editing system 106 utilizes the feature tensor to modulate the pattern blocks within the human repair GAN.

In one or more embodiments, the human repair GAN utilizes a StyleGAN or StyleGAN2 architecture (as discussed and incorporated above with respect to stylgan 2). For example, scene-based image editing system 106 utilizes a convolution layer (e.g., a plurality of fully connected convolution layers) to convert a digital image into intermediate latent vectors (e.g., W vectors in W or w+ space). Next, the scene-based image editing system 106 converts the intermediate hidden vectors into hidden style vectors using a series of learned transformations.

Scene-based image editing system 106 utilizes a series of layers called style blocks to convert style vectors into feature representations. In particular, scene based image editing system 106 processes the first pattern vector with the first pattern block. The first pattern block generates a first intermediate feature vector and passes the first intermediate feature vector to the second pattern block. The second pattern block processes the first intermediate feature vector and the second pattern vector to generate a second intermediate feature vector. The second style block generates a second intermediate feature vector. Thus, scene-based image editing system 106 iteratively utilizes style blocks to generate a series of intermediate feature vectors. Scene-based image editing system 106 may also modulate these style blocks by combining additional representations with style vectors and/or intermediate feature vectors.

As described above, the scene-based image editing system 106 generates the local appearance feature tensor 5816. For example, the local appearance feature tensor 5816 includes a scaled and/or shifted tensor. In other words, the scene-based image editing system 106 modulates the feature representation of a particular style block with different scaling and shifting tensors according to the pixel value locations of the digital image 5804. Thus, scene-based image editing system 106 modulates (scales and shifts) the feature representation of human repair GAN 5814 with local appearance feature tensors 5816 from visual appearance code 5810 corresponding to a particular resolution of a particular style block. Further, as shown, scene-based image editing system 106 generates modified digital image 5818 via human repair GAN 5814.

For example, in one or more embodiments, the scene-based image editing system utilizes a scaling tensor and a shifting tensor for two-dimensional modulation, as described in B.AlBahar, J.Lu, J.Yang, Z.Shu, E.Shechtman, and J.Huang in Pose with Style: detail-previous Pose-guided image Synthesis with Conditional StyleGAN, SIGGRAPH Asia 2021 (Pose with Style), which are incorporated herein by reference in their entirety. Unlike gestures with patterns, in one or more embodiments, scene-based image editing system 106 performs modulation for different features and inputs, for different architectures, and for generating different results.

As described above, fig. 58 illustrates scene-based image editing system 106 generating a single visual appearance code 5810 and further utilizing parametric neural network 5812 to generate local appearance feature tensors 5816 (which include spatially varying shift tensors and/or spatially varying scaling tensors). Although not shown in fig. 58, in some implementations, scene-based image editing system 106 generates visual appearance codes and structural codes for each resolution of style blocks within human repair GAN 5814. For each visual appearance code, the scene-based image editing system 106 generates local appearance feature tensors from the visual appearance code corresponding to the resolution of the corresponding channel or style block using the parametric neural network 5812. Thus, each generated local appearance feature tensor modulates the structural coding at a different style block of human repair GAN 5814. More details regarding the regulation of human repair GAN 5814 are given below in the description of fig. 58.

Although the above description illustrates the scene-based image editing system 106 utilizing a single structural guide map, in one or more embodiments, the scene-based image editing system 106 utilizes multiple structural guide maps. Specifically, scene-based image editing system 106 generates structural code 5808 from a combination of key point maps and pose maps. In other cases, the scene-based image editing system 106 generates structural codes from a combination of the segmentation map and the digital image. Further, the scene-based image editing system 106 may utilize various combinations of available structure guide maps to generate the structural code 5808.

As described above, FIG. 59 provides further details regarding generating structural codes in accordance with one or more embodiments. For example, FIG. 59 shows a scene-based image editing system 106 that utilizes a layered encoder to generate structural code 5910. In particular, the layered encoder includes a plurality of layers, and each layer of the layered encoder corresponds to a different resolution of the pattern block of the human repair GAN. In particular, the scene-based image editing system 106 utilizes a layered encoder having a U-shaped architecture and generates a structural code corresponding to each resolution of the structural guide map 5902.

As shown in fig. 59, the scene-based image editing system 106 generates the structural code 5910 using a layered encoder for the structural guide map 5902. The layered encoder includes a downsampling layer 5904 and an upsampling layer 5906. For downsampling, the layered encoder moves from high resolution to low resolution. In particular, downsampling involves moving from a higher resolution (e.g., 256 x 512) to a lower resolution corresponding to pattern blocks of the human repair GAN. Furthermore, scene-based image editing system 106 for downsampling also utilizes skipped connections to the corresponding upsampling layers.

For example, upsampling includes moving a pattern block corresponding to a human repair GAN from a lower resolution to a higher resolution. In particular, the scene-based image editing system 106 utilizes the skip connection described above to receive encoding from the corresponding downsampling layer before moving to the next upsampling layer. In response to receiving the encoding from downsampling layer 5904 via the skip connection, scene-based image editing system 106 combines the encoding. In particular, scene-based image editing system 106 interfaces with encodings from the respective downsampling and upsampling layers. The scene-based image editing system 106 then passes the combined code to the next upsampling layer and repeats the process.

As shown in fig. 59, the scene-based image editing system 106 generates a structural code 5910 from the structural guide map 5902 by using a layered encoder. In particular, fig. 59 shows a downsampling layer 5904 having downsampling chunks of each resolution utilized by scene-based image editing system 106. Specifically, downsampling layer 5904 includes downsampling blocks ranging from 128×256 to 2×4. Further, fig. 59 shows an up-sampling block with increased resolution. Specifically, the upsampling blocks range from 4×8 to 128×259. In addition, as shown, FIG. 59 has skip connections 5905a-5905c between respective downsampling and upsampling layers. For example, skipping connections 5905a-5905c includes skipping layers within the neural network and feeding output from one layer as input to another layer. Specifically, skipping connections 5905a-5905c helps prevent degradation of features within the structural boot graph 5902. Further, FIG. 59 shows that scene-based image editing system 106 generates structural code 5910 from convolutional layer 5908. For example, via skip connections 5905a-5905c between downsampling layer 5904 and upsampling layer 5906, scene based image editing system 106 passes the output from the layered encoder to convolutional layer 5908 to generate structural code 5910.

As previously described, the layered encoder generates the encoding for the various resolutions (i.e., each resolution of the channel/pattern blocks of the human repair GAN). As shown, FIG. 59 shows a scene-based image editing system 106 that utilizes a convolutional layer 5908 and generates a structural code 5910 corresponding to a 16×32 resolution of style blocks. Although not shown, the scene-based image editing system 106 generates a structural code via a layered encoder for each resolution of the different style blocks for analyzing the different resolutions of the structural code.

FIG. 60 also shows a layered encoder, however, in accordance with one or more embodiments, scene-based image editing system 106 utilizes the layered encoder in FIG. 60 to generate local appearance feature tensors. For example, fig. 60 shows a scene-based image editing system 106 that utilizes a layered encoder to generate local appearance feature tensors with various downsampling layers 6004 and upsampling layers 6006. Downsampling layer 6004 and upsampling layer 6006 were discussed above in connection with the description of fig. 56. As previously described, scene-based image editing system 106 combines the encodings from downsampling layer 6004 and upsampling layer 6006 through a skip connection. However, fig. 60 further illustrates that scene-based image editing system 106 passes the combination to parametric neural network 6008 (e.g., parametric neural network 5812 discussed in fig. 58). In the parametric neural network 6008, the scene-based image editing system 106 utilizes additional convolution layers to generate various scaling and shift tensors.

As an example, fig. 60 shows that scene-based image editing system 106 passes encoding from each resolution (e.g., 128 x 256 and 16 x 32) to parametric neural network 6008. In particular, fig. 60 shows that scene-based image editing system 106 passes encoding from 128×256 resolution to convolutional layer 6010a and convolutional layer 6010b. In addition, fig. 60 also shows that scene-based image editing system 106 passes encoding from 16 x 32 resolution to convolutional layer 6010c and convolutional layer 6010d. Although fig. 60 only shows the scene-based image editing system 106 generating scaling and shifting tensors for two different resolutions (e.g., 128 x 256 and 16 x 32), in one or more embodiments the scene-based image editing system 106 generates scaling and shifting tensors for each resolution of the digital image 6002.

As shown in fig. 60, scene-based image editing system 106 generates shift and scale tensors (e.g., α and β) for each resolution. The local appearance feature tensors are discussed above, and include scaling and shifting tensors. For example, another description of the local appearance feature tensor includes a spatially varying scaling tensor and a spatially varying shifting tensor. In particular, the spatially varying aspect of the scaling and shifting tensor means that the scaling and shifting tensor varies according to the spatial position. Thus, different positions of the digital image 6002 utilize different scaling and shifting tensors. For example, scene-based image editing system 106 may modulate a first region of a first intermediate feature vector of a first pattern block (corresponding to a first resolution) with a first scaling value and a first shift value. The scene-based image editing system 106 may modulate a second region of the first intermediate feature vector of the first pattern block with the second scaling value and the second shift value.

Similarly, scene-based image editing system 106 may modulate a third region of a second intermediate feature vector of a second style block (corresponding to a second resolution) with a third scaling value and a third shift value. Further, the scene-based image editing system 106 may modulate a fourth region of the second intermediate feature vector of the second style block with a fourth scaling value and a fourth shift value.

As described above, FIG. 61 provides more detail regarding modulation in accordance with one or more embodiments. For example, fig. 61 shows a scene-based image editing system 106 that utilizes local appearance feature tensors 6104 to modulate the vector at each style block of the human repair GAN. In particular, fig. 61 shows that scene-based image editing system 106 generates local appearance feature tensors 6104 via parametric neural network 6102. As described above, the local appearance feature tensor 6104 includes scaling and shifting tensors that spatially vary at different resolutions, as indicated by the first α, first β, second α, second β, and so on. Further, fig. 61 shows a scene-based image editing system 106 that utilizes a local appearance feature tensor 6104 to modulate each of the pattern blocks of the human repair GAN.

For example, as shown, the scene-based image editing system 106 utilizes a human repair GAN having a plurality of style blocks. In particular, fig. 61 shows that scene-based image editing system 106 generates intermediate vector 6110 using first style block 6108. In other words, the human repair GAN includes each style block shown in fig. 61. For example, fig. 61 shows a scene-based image editing system 106 that utilizes a first style block 6108, the first style block 6108 including a set of layers corresponding to a human repair GAN of a certain resolution of a digital image. To illustrate, the first pattern block 6108 includes a first resolution of a digital image, and the second pattern block 6112 is increased to a second resolution. As shown, the first pattern block 6108 of the human repair GAN utilizes corresponding scaling and shifting tensors (first α and β). The scene-based image editing system 106 modulates the vector (e.g., style vector and/or intermediate vector representation) at each style block of the human repair GAN with a corresponding scaled and shifted tensor from the local appearance feature tensor 6104. Specifically, fig. 61 shows a first type of block 6108 that includes a modulation layer, a convolution layer, and a demodulation layer. For example, scene-based image editing system 106 generates intermediate vector 6110 using a modulation layer, a convolution layer, and a demodulation layer via first pattern block 6108.

Modulation in a neural network refers to the process of adjusting the intensity or influence of certain inputs or paths in the network. This may be accomplished by adjusting the weight or bias of a particular neuron or group of neurons. Modulation may be used to improve the performance of the network, for example by increasing the sensitivity of certain neurons to important features in the input data, or by reducing the effects of noise or irrelevant information. It may also be used to control the behaviour of the network, for example by adjusting the strength of certain paths to control the output of the network. For example, in fig. 61, the modulation includes scaling and shifting the vector of the human repair GAN. In particular, the structural feature representation is scaled and shifted to affect the output of the human repair GAN. By utilizing the local appearance feature tensor 6104, a modulated includes a spatially varying scaled tensor and a spatially varying shifted tensor. For example, the scene-based image editing system 106 generates a repaired human within the modified digital image by modulating the layer of human repair GAN based on the local appearance feature tensor 6104 (e.g., information that is locally applicable to a particular location within the digital image).

As described above, for human repair GAN, each pattern block contains multiple layers (e.g., modulation layer, convolution layer, and demodulation layer). The scene-based image editing system 106 modulates the first vector at the modulation layer, passes the modulated structural code to the convolution layer, and then to the demodulation layer (or normalization layer) to generate an intermediate vector 6110. In fact, in one or more implementations, the scene-based image editing system 106 modulates the mean and standard deviation of features (e.g., pattern vectors) using scaling and shifting tensors (at the modulation layer), applies a convolution layer, and normalizes (at the demodulation layer) the output of the convolution to zero mean and unit standard deviation prior to adding offset and pattern 2 noise broadcast operations. In some embodiments, scene-based image editing system 106 utilizes the modulation method described in Pose with Style.

As further shown, the scene-based image editing system 106 utilizes the human repair GAN at the second style block 6112 to generate additional intermediate vectors 6114 from the intermediate vectors 6110 (and/or style vectors analyzed/processed by each style block but not shown in fig. 61) and the local appearance feature tensor 6104. The process of passing the intermediate vector 6110 through the modulation layer, convolution layer, and demodulation layer continues, and in particular, the scene-based image editing system 106 modulates the second vector with the second α and β (e.g., additional spatially varying scaling tensors and additional spatially varying shifting tensors at the second resolution). As shown, the second style block 6112 generates an additional intermediate vector 6114, while the third style block 6116 generates an additional intermediate vector 6118 from the additional intermediate vector 6114.

As previously described, the scene-based image editing system 106 generates a segmented digital image (e.g., segmented digital image 5306 discussed in the description of fig. 53). As shown, FIG. 62 illustrates an overview of a scene-based image editing system 106 that utilizes separate segmentation branches to generate a filled segmentation map and a modified digital image in accordance with one or more embodiments. For example, FIG. 62 illustrates scene-based image editing system 106 performing act 6204 based on digital image 6202. Specifically, act 6204 includes determining an area to paint within digital image 6202. As previously described, determining the region to repair includes a selection by a user of the client device or a query of the split machine learning model.

Further, as shown, fig. 62 illustrates a scene-based image editing system 106 that utilizes generating a segmented machine learning model 6206. However, prior to utilizing the generated segmented machine learning model 6206, the scene-based image editing system 106 utilizes the segmented machine learning model to generate an initial segmentation map (not shown in fig. 62). Specifically, prior to generating the filled segmentation map 6208, the scene-based image editing system 106 generates an initial segmentation map via the segmentation machine learning model previously discussed above (e.g., the segmentation machine learning model described above). Specifically, the scene-based image editing system 106 generates an initial segmentation map from the digital image 6202, and then generates a filled segmentation map 6208 based on the initial segmentation map and the digital image 6202, wherein an area for repair is determined.

As described above, the scene-based image editing system 106 generates an initial segmentation map. However, due to the hole or occlusion of the person depicted within the digital image 6202, the segmentation map includes unclassified regions. In particular, the unclassified region corresponds to a human repair region. The scene-based image editing system 106 generates a filled semantic graph using a generative semantic machine learning model (e.g., the generative semantic machine learning model 4906 discussed in the description of fig. 49). Here, for example, the scene-based image editing system 106 utilizes the generated segmentation machine learning model 6206 to generate the filled segmentation map 6208.

Thus, the scene-based image editing system 106 generates a true and accurate filled segmentation of the determined region for repair using a generation model specific to the segmentation map. For the filled segmentation map 6208, the scene-based image editing system 106 assigns labels to individual pixels within the digital image 6202. Specifically, scene-based image editing system 106 assigns a label to each pixel within digital image 6202. For example, scene-based image editing system 106 assigns labels to pixels in a manner that groups pixels that share certain features (e.g., background portions or foreground portions) together. Further, the fill segmentation map 6208 assigns labels to areas determined to be repaired. Thus, the various models utilize the fill segmentation map 6208 to determine how to fill the region indicated as repaired.

For example, as described above, the scene-based image editing system 106 generates a filled segmentation map and the scene-based image editing system 106 assigns a human segmentation class to the region determined for repair. Specifically, the repair of the region corresponds to the person depicted within the digital image 6202, and thus the region for repair includes a segmentation classification of the person. In particular, the human segmentation class corresponds to a sub-portion of a human, such as a hand, foot, arm, leg, torso, or head. Further as shown, the scene-based image editing system 106 utilizes an encoder 6212, the encoder 6212 generating an output based on the fill segmentation map 6208 and the digital image 6202. In particular, the scene-based image editing system 106 also utilizes the output of the encoder 6212 and, via the human repair GAN6214, the scene-based image editing system 106 generates a modified digital image 6216.

As previously described, in one or more embodiments, the scene-based image editing system 106 utilizes a human repair GAN and a background GAN. Fig. 63 shows details regarding the scene-based image editing system 106, which scene-based image editing system 106 generates a modified digital image 6314 by utilizing both the background GAN 6310 and the human repair GAN 6312. For example, fig. 63 illustrates a scene-based image editing system 106 that utilizes an input digital image 6302 to segment or identify portions within the input digital image 6302. For example, scene-based image editing system 106 determines background portion 6304. The background portion 6304 includes non-foreground elements such as pixel values corresponding to landscapes, buildings, mountains, and sky. For example, for the input digital image 6302, the background portion 6304 includes a portion of the input digital image 6302 that does not overlap with the depicted person. In addition, the scene-based image editing system 106 utilizes the background GAN 6310 to generate modified background portions of the input digital image 6302. In one or more implementations, the scene-based image editing system 106 utilizes a foreground/background segmentation machine learning model to separate/segment a foreground (human) portion of a digital image from a background portion of the digital image.

The scene based image editing system 106 utilizes the modified background portion of the input digital image 6302 to generate a modified digital image 6314. The background GAN 6310 includes a GAN that generates a modified background portion of a digital image. Specifically, the background GAN 6310 replaces or adds pixel values to the background portion 6304 to generate a background portion 6304 that looks true. For example, in one or more embodiments, the scene-based image editing system 106 utilizes a cascade modulation repair neural network (or another architecture such as commodgan) as the background GAN 6310.

In addition to determining the background portion 6304, the scene-based image editing system 106 also determines the human portion 6306 of the input digital image 6302. For example, scene-based image editing system 106 determines portion 6306 of the person and generates modified pixels corresponding to the area of the person. In particular, the scene-based image editing system 106 utilizes the human repair GAN 6312 to generate modified pixels of the human portion 6306 depicted in the input digital image 6302. As shown, the scene-based image editing system 106 utilizes a combination of the modified background portion and the modified human portion to generate a modified digital image 6314. Thus, as discussed just in some embodiments, scene-based image editing system 106 utilizes separate generative models to modify background portion 6304 and human portion 6306.

The scene-based image editing system 106 may utilize various methods to combine the background and human portions of the digital image produced by the human repair GAN 6312 and the background GAN 6310. In some implementations, the scene-based image editing system 106 masks the indicated objects for removal, repairs background pixels with the background GAN 6310, and repairs human pixels (for holes or filled portions) with the human repair GAN 6312. In particular, scene-based image editing system 106 repairs remaining background pixels independently of pixels corresponding to the human repair area.

In some cases, this approach may result in penetration of the human region into the background pixels. In one or more embodiments, scene-based image editing system 106 utilizes different methods to address this potential bleeding. For example, in one or more implementations, the scene-based image editing system 106 utilizes the human repair GAN 6312 to complete pixels of a person, but segments and removes the person from the digital image. The scene based image editing system 106 uses the background GAN 6310 to repair background pixels (no person in the digital image) and then reinserts the person into the digital image.

To illustrate, scene-based image editing system 106 generates modified digital image 6314 by generating an intermediate digital image. For example, the scene-based image editing system 106 generates an intermediate digital image by removing areas of people in the input digital image 6302. Specifically, the scene-based image editing system 106 masks the area corresponding to the person and removes the masked area. Further, for intermediate digital images, scene-based image editing system 106 repairs the remaining portion including background portion 6304 with background GAN 6310. In so doing, the intermediate digital image contains a background portion 6304 in which the human portion 6306 is temporarily removed.

Further, the scene-based image editing system 106 generates modified pixels (e.g., human segmentation classification) corresponding to the region of the human portion 6306 by utilizing the human repair GAN 6312. For example, scene-based image editing system 106 alone generates modified pixels for human portion 6306 and inserts modified pixels corresponding to human portion 6306 into the intermediate digital image. In this way, scene-based image editing system 106 generates modified digital image 6314.

In some cases, this approach may also lead to artifacts along the boundary of the person and background pixels, especially in cases where segmentation is inaccurate. As shown in fig. 63, in one or more embodiments, scene-based image editing system 106 determines an overlapping portion 6308 of input digital image 6302. In particular, the overlap portion 6308 includes determining an overlap between the human portion 6306 and the background portion 6304. For example, in one or more embodiments, scene-based image editing system 106 determines overlap portion 6308 by determining a first set of pixels corresponding to human pixels from a repaired portion of the digital image and a second set of pixels corresponding to background pixels from the repaired portion of the digital image. The scene-based image editing system 106 may expand the first set of pixels (and/or the second set of pixels) to determine the overlap portion.

In one or more embodiments, the scene-based image editing system utilizes the human repair GAN 6312 to generate modified pixel values corresponding to the human portion 6306 and the overlap portion 6308. In addition, the scene based image editing system 106 utilizes the background repair GAN on the background section 6304. Specifically, scene-based image editing system 106 performs background repair on background portion 6304 based on human repair pixels (e.g., scene-based image editing system 106 utilizes human repair GAN 6312 to generate modified pixel values corresponding to human portion 6306 and overlap portion 6308). Thus, the combination of the repaired background portion, the repaired human portion, and the repaired overlapping portion produces a modified digital image 6314. In other words, the scene-based image editing system 106 first repairs pixels corresponding to the human portion 6306, and then the scene-based image editing system 106 repairs the background portion 6304 conditioned on performing the repair on the human portion 6306.

As described above, the scene-based image editing system 106 determines portions of the input digital image 6302. In one or more embodiments, the scene-based image editing system 106 masks various determined portions of the input digital image 6302. For example, scene-based image editing system 106 generates masking for the region of human portion 6306 and masking for background portion 6304. In particular, the mask includes a portion that segments the input digital image 6302. For example, the scene-based image editing system 106 generates masking by utilizing a segmentation machine learning model to identify pixels corresponding to the indicated regions (e.g., the human portion 6306, the background portion 6304, and/or the overlapping portion 6308).

As just described, the scene based image editing system 106 uses masking of various portions of the input digital image 6302. Specifically, the scene-based image editing system 106 first determines a human portion 6306 and an overlapping portion 6308 (human and background overlap), generates a mask, and repairs the masked human portion and the masked overlapping portion with a human repair GAN 6312. Subsequently, the scene-based image editing system 106 repairs the masked background portion using the background GAN 6310. Further, scene-based image editing system 106 combines the repaired masked background portion with the repaired masked human portion and the masked overlapping portion to produce modified digital image 6314.

As shown, FIG. 64 illustrates details of a scene-based image editing system 106 that trains a human repair GAN in accordance with one or more embodiments. For example, fig. 64 shows a scene-based image editing system 106 training a human repair GAN 6408 to partially reconstruct the loss 6420 and/or combat the loss 6414. As previously described, it is difficult to generate training data for repair of an occlusion due to the lack of basic real information under the occlusion. In one or more embodiments, the scene-based image editing system 106 solves this problem by utilizing partial reconstruction losses that determine a loss metric for portions of the digital image outside of the hole/occlusion to improve the accuracy of the network. In other words, in some implementations, the scene-based image editing system 106 does not take advantage of reconstruction losses on the occlusion object, such that the human repair GAN 6408 does not generate an occlusion object.

For example, fig. 64 shows a scene-based image editing system 106 utilizing a structure-based guide map 6402 and an encoder 6406 of a digital image 6404. Further, with human repair GAN 6408, scene-based image editing system 106 generates modified digital image 6410. The scene-based image editing system 106 then compares the modified digital image 6410 with the digital image 6416 with holes to determine partial reconstruction losses 6420.

Reconstruction loss is a measure of similarity or fidelity between the digital image and the generated digital image. For example, the reconstruction loss includes the proximity of the decoder output relative to the original input. In one or more implementations, the scene-based image editing system 106 utilizes a loss function, such as a mean square error or other loss metric, to determine reconstruction losses.

In one or more embodiments, the scene-based image editing system 106 determines the partial reconstruction loss 6420 by focusing on a particular portion of the digital image. For example, in one or more implementations, the scene-based image editing system 106 determines partial reconstruction losses 6420 between regions of the digital image 6404 outside of the hole/occlusion. For example, the scene based image editing system 106 utilizing the human repair GAN 6408 repairs a portion of the digital image 6404. In so doing, the scene-based image editing system 106 determines a reconstruction loss between the repair portion and the corresponding portion (repair portion) of the digital image 6404. Thus, as shown, the scene-based image editing system 106 compares the modified digital image 6410 with the digital image 6416 with holes to generate partial reconstruction losses 6420.

In determining the partial reconstruction loss 6420, the scene-based image editing system 106 utilizes the partial reconstruction loss to train the human repair GAN 6408. For example, the scene-based image editing system 106 modifies parameters of the human repair GAN 6408 to reduce partial reconstruction losses 6420. In one or more implementations, the scene-based image editing system 106 propagates the partial reconstruction loss 6420 back to the human repair GAN 6408.

As further shown, scene-based image editing system 106 also determines countermeasures 6414. As previously described, the discriminator 6412 and the generative model attempt to generate a realistic digital image. For example, the scene-based image editing system 106 determines the fight loss of the generated model. In particular, as described above, the challenge loss includes a generative model and discriminator 6412 that try to spoof each other in zero and play. As shown in fig. 64, the scene-based image editing system 106 propagates the fight loss 6414 back to the human repair GAN 6408. Thus, the scene-based image editing system 106 modifies the parameters of the human repair GAN 6408 based on the countermeasures loss 6414 and the partial reconstruction loss 6420.

In some embodiments, scene-based image editing system 106 detects people and obstructions/occlusions within the digital image and provides recommendations. For example, the scene-based image editing system 106 provides various recommendations to remove detected obstructions/occlusions, and provides the user of the client device with the option of selecting the provided recommendations. In particular, scene-based image editing system 106 may also provide for the generation of samples of various objects removed from the digital image and the repair of people and/or backgrounds.

Further, in some embodiments, the scene-based image editing system 106 intelligently detects the upload of digital images of the portrayal in addition to the existing digital images being edited within the image editing application. In particular, the scene-based image editing system 106 receives an upload of a digital image depicting a person and automatically removes objects (e.g., objects that obscure or obstruct the person) and performs repair of the person. Further, in this case, the scene-based image editing system 106 may also perform background repair to conform to an already existing digital image. Further, in response to receiving the additionally uploaded digital image, scene-based image editing system 106 performs human repair on humans depicted within the existing digital image being edited within the image editing application.

Turning to fig. 65, additional details will now be provided regarding the various components and capabilities of the scene-based image editing system 106. In particular, FIG. 65 illustrates a scene-based image editing system 106 implemented by a computing device 6500 (e.g., server 102 and/or one of client devices 110a-110 n). In addition, scene-based image editing system 106 is also part of image editing system 104. As shown, in one or more embodiments, scene-based image editing system 106 includes, but is not limited to, neural network application manager 6502, semantic scene graph generator 6504, semantic history log generator 6506, image modification engine 6508, user interface manager 6510, and data store 6512 (which includes neural network 6514, image analysis graph 6516, real world class description graph 6518, behavior policy graph 6520, and semantic history log 6522).

As just described, and as shown in fig. 65, the scene-based image editing system 106 includes a neural network application manager 6502. In one or more embodiments, the neural network application manager 6502 implements one or more neural networks for editing digital images, such as a segmentation neural network, a repair neural network, a shadow detection neural network, an attribute classification neural network, or various other machine learning models for editing digital images. In some cases, the neural network application manager 6502 automatically implements one or more neural networks without user input. For example, in some cases, the neural network application manager 6502 utilizes one or more neural networks to pre-process the digital image prior to receiving user input to edit the digital image. Thus, in some cases, the neural network application manager 6502 implements one or more neural networks in anticipation of modifying the digital image.

In addition, as shown in fig. 65, scene-based image editing system 106 includes semantic scene graph generator 6504. In one or more embodiments, semantic scene graph generator 6504 generates a semantic scene graph of a digital image. For example, in some cases, the scene-based image editing system 106 utilizes information about the digital image collected via one or more neural networks (e.g., implemented by the neural network application manager 6502) and generates a semantic scene graph of the digital image. In some cases, semantic scene graph generator 6504 automatically generates a semantic scene graph of a digital image without user input (e.g., desired modification of the digital image). In one or more embodiments, the semantic scene graph generator 6504 uses the image analysis graph, the real world class description graph, and/or the behavioral policy graph to generate a semantic scene graph of the digital image.

Further, as shown in fig. 65, the scene-based image editing system 106 includes a semantic history log generator 6506. In one or more embodiments, the semantic history log generator 6506 generates a semantic history log of digital images. In particular, the semantic history log generator 6506 tracks semantic states of digital images and generates semantic history logs having representations (e.g., visual representations) of those semantic states.

As shown in fig. 65, the scene based image editing system 106 also includes an image modification engine 6508. In one or more embodiments, the image modification engine 6508 modifies digital images. For example, in some cases, the image modification engine 6508 modifies the digital image by modifying one or more objects depicted in the digital image. For example, in some cases, the image modification engine 6508 deletes objects from or moves objects within the digital image. In some implementations, the image modification engine 6508 modifies one or more attributes of the object. In some embodiments, the image modification engine 6508 modifies an object in the digital image based on a relationship between the object in the digital image and another object.

Further, as shown in fig. 65, the scene-based image editing system 106 includes a user interface manager 6510. In one or more embodiments, the user interface manager 6510 manages a graphical user interface of a client device. For example, in some cases, the user interface manager 6510 detects and interprets user interactions with a graphical user interface (e.g., detects selection of an object depicted in a digital image). In some embodiments, the user interface manager 6510 also provides visual elements for display within a graphical user interface, such as visual indications of object selection, interactive windows displaying object properties, and/or user interactions for modifying objects.

In addition, as shown in fig. 65, the scene-based image editing system 106 includes a data store 6512. In particular, the data store 6512 includes a neural network 6514, an image analysis graph 6516, a real world class description graph 6518, a behavior policy graph 6520, and a semantic history log 6522.

Each component 6502-6522 of the scene-based image editing system 106 optionally includes software, hardware, or both. For example, components 6502-6522 include one or more instructions stored on a computer-readable storage medium and executable by a processor of one or more computing devices, such as a client device or a server device. The computer-executable instructions of the scene-based image editing system 106, when executed by one or more processors, cause a computing device to perform the methods described herein. Alternatively, the components 6502-6522 comprise hardware, such as special purpose processing devices that perform a specific function or group of functions. Alternatively, the components 6502-6522 of the scene-based image editing system 106 comprise a combination of computer executable instructions and hardware.

Further, the components 6502-6522 of the scene-based image editing system 106 may be implemented, for example, as one or more operating systems, one or more stand-alone applications, one or more modules of applications, one or more plug-ins, one or more library functions or functions that may be invoked by other applications, and/or a cloud computing model. Thus, the components 6502-6522 of the scene-based image editing system 106 may be implemented as stand-alone applications, such as desktop or mobile applications. Further, the components 6502-6522 of the scene-based image editing system 106 may be implemented as one or more web-based applications hosted on a remote server. Alternatively or additionally, the components 6502-6522 of the scene-based image editing system 106 may be implemented in a set of mobile device applications or "application programs". For example, in one or more embodiments, scene-based image editing system 106 includes, for example, a system such as ADOBE STARDUST, orSuch as digital software applications or operations in conjunction therewith. The foregoing are registered trademarks or trademarks of Adobe corporation in the united states and/or other countries.

1-65, corresponding text and examples provide a number of different methods, systems, devices, and non-transitory computer-readable media for the scene-based image editing system 106. In addition to the foregoing, one or more embodiments may be described in terms of flow diagrams comprising acts for accomplishing a particular result, as illustrated in fig. 66-67. Fig. 66-67 may be performed with more or fewer actions. Moreover, the acts may be performed in a different order. Additionally, actions described herein may be repeated or performed in parallel with each other or in parallel with different instances of the same or similar actions.

FIG. 66 illustrates a flow diagram of a series of acts 6600 for implementing perspective-aware object-movement operations on digital images in accordance with one or more embodiments. FIG. 66 illustrates acts in accordance with one embodiment, alternative embodiments may omit, add, reorder, and/or modify any of the acts shown in FIG. 66. In some implementations, the acts of FIG. 66 are performed as part of a method. For example, in some embodiments, the acts of FIG. 66 are performed as part of a computer-implemented method. Alternatively, the non-transitory computer-readable medium may store instructions thereon that, when executed by at least one processor, cause the at least one processor to perform operations comprising the acts of fig. 66. In some embodiments, the system performs the actions of FIG. 66. For example, in one or more embodiments, a system includes at least one memory device including a depth prediction neural network. The system also includes at least one processor configured to cause the system to perform the actions of fig. 66.

The series of actions 6600 includes an action 6602 of determining a vanishing point associated with a digital image of a drawing object. For example, in some implementations, act 6602 involves analyzing a digital image and identifying vanishing points for the digital image based on the analysis.

The series of actions 6600 also includes an action 6604 for detecting one or more user interactions for moving an object within the digital image. To illustrate, in some embodiments, act 6604 involves detecting user selection of an object and further user interaction for moving the object.

The series of actions 6600 also includes an action 6606 for resizing the object based on moving the object relative to the vanishing point. For example, in one or more embodiments, act 6606 involves performing perspective-based resizing of objects within the digital image based on moving the objects relative to the vanishing point.

As shown in fig. 66, act 6606 includes a sub-act 6608 for moving the object toward the vanishing point and reducing the size of the object. For example, in one or more embodiments, moving the object relative to the vanishing point within the digital image includes moving the object toward the vanishing point on a line from the object to the vanishing point; and performing perspective-based resizing of the object based on moving the object relative to the vanishing point includes reducing the size of the object based on moving the object toward the vanishing point.

As shown in fig. 66, act 6606 also includes a sub-act 6610 of moving the object away from the vanishing point and increasing the size of the object. For example, in some embodiments, moving the object within the digital image relative to the vanishing point includes moving the object away from the vanishing point on a line from the object to the vanishing point; and performing perspective-based resizing of the object based on moving the object relative to the vanishing point includes increasing a size of the object based on moving the object away from the vanishing point. Thus, in some cases, the scene-based image editing system 106 changes the size of the object based on whether the object is moving toward or away from the vanishing point.

In one or more embodiments, scene-based image editing system 106 generates perspective scaling of objects based on moving the objects within the digital image. Thus, in some cases, performing perspective-based resizing of the object within the digital image includes performing perspective-based resizing of the object using a perspective scaling ratio of the object. In some implementations, generating a perspective scale of the object based on moving the object within the digital image includes: determining a first distance of the object from a horizontal line associated with the digital image prior to moving the object; determining a second distance of the object from the horizon after the object moves; and generating the perspective scaling ratio using the first distance and the second distance.

In some embodiments, series of actions 6600 further includes actions for occluding objects within the digital image based on object movement. For example, in some cases, moving the object within the digital image relative to the vanishing point includes moving the object to a position where a portion of the object overlaps a portion of an additional object depicted within the digital image. Thus, in some cases, the scene-based image editing system 106 modifies the digital image to obscure portions of the object or portions of the additional object based on a comparison of the object depth of the object and the object depth of the additional object. For example, in some implementations, the scene-based image editing system 106 determines that the object's object depth is greater than the object's object depth based on comparing the object's object depth to the object depth of the additional object. Thus, in some embodiments, modifying the digital image to occlude the portion of the object or the portion of the additional object comprises: based on determining that the object depth of the object is greater than the object depth of the additional object, the portion of the additional object is used to modify the digital image to occlude the portion of the object.

In some cases, series of actions 6600 further includes actions for providing additional features related to moving the object relative to the vanishing point. For example, in one or more embodiments, the actions include providing one or more perspective-based dimensional previews for display within the digital image, the previews along a line from the object to a vanishing point when the object is selected for movement within the digital image. In some cases, the actions include generating a content fill of the object within the digital image prior to detecting one or more user interactions for moving the object; and exposing content fills within the digital image in response to moving the object relative to the vanishing point.

To provide an illustration, in one or more embodiments, the scene-based image editing system 106 determines a vanishing point associated with a digital image depicting an object at a first location within the digital image; moving the object to a second position within the digital image relative to the vanishing point in response to one or more user interactions with the object; determining a perspective scale of the object based on the first location and the second location; and performing perspective-based resizing of the object within the digital image using the perspective scaling ratio.

In some cases, determining the perspective scale for the object based on the first location and the second location includes: detecting a horizontal line associated with the digital image; determining a first distance from the first location to the horizontal line and a second distance from the second location to the horizontal line; and determining the perspective scaling ratio using the first distance and the second distance. In some cases, detecting the horizontal line of the digital image includes detecting the horizontal line using a neural network. In some embodiments, determining the vanishing point associated with the digital image includes determining the vanishing point to be located outside of the digital image.

In one or more embodiments, the scene-based image editing system 106 also generates a plurality of perspective-based dimensional previews of the object that indicate the size of the object at locations within the digital image other than the first location; detecting a selection of movement of the object within the digital image; and responsive to the selection, providing the plurality of perspective-based dimensional previews along a line from the object to the vanishing point for display within the digital image. In some cases, generating the plurality of perspective-based dimensional previews of the object includes indicating a size of the object at a location within the digital image other than the first location, including: generating a first perspective-based dimensional preview indicating a dimension greater than a dimension of the object at the first location, the greater dimension corresponding to a location farther from the vanishing point than the first location; and generating a second perspective-based dimensional preview indicating a smaller size than the size of the object as the first location, the smaller size corresponding to a location closer to the vanishing point than the first location.

In some cases, scene-based image editing system 106 moves vanishing points associated with the digital image to the modified location in response to at least one user interaction with the vanishing points. Thus, in some embodiments, moving the object to a second position within the digital image relative to the vanishing point comprises moving the object to a second position within the digital image relative to the vanishing point at the modified position.

To provide another illustration, in one or more embodiments, the scene-based image editing system 106 detects one or more user interactions for moving a first object depicted within the digital image relative to a vanishing point associated with the digital image; and modifying the digital image in response to the one or more user interactions with the first object by: moving the first object within the digital image relative to the vanishing point from a first position to a second position, a portion of the first object at the second position overlapping a portion of a second object depicted within the digital image; performing perspective-based resizing of the first object within the digital image based on moving the first object from the first position to the second position; and based on the object depth of the first object and the object depth of the second object, occluding the portion of the first object with the portion of the second object.

In some embodiments, determining the object depth of the first object at the second location using the depth prediction neural network comprises determining a first average object depth of the first object at the second location using the depth prediction neural network; and occluding the portion of the first object with the portion of the second object based on the object depth of the first object and the object depth of the second object includes occluding the portion of the first object with the portion of the second object based on the first average object depth of the first object and a second average object depth of the second object. For example, in some cases, occluding the portion of the first object with the portion of the second object based on the first average object depth of the first object and the second average object depth of the second object includes occluding the portion of the first object with the portion of the second object based on determining that the first average object depth is greater than the second average object depth.

In some embodiments, the scene-based image editing system 106 also moves the first object from a second position within the digital image to a third position relative to the vanishing point, an additional portion of the first object at the third position overlapping with an additional portion of the second object; determining an additional object depth of the first object at the third location using the depth prediction neural network; and masking the additional portion of the second object with the additional portion of the first object based on the additional object depth of the first object and the object depth of the second object.

FIG. 67 illustrates a flow diagram of a series of acts 6700 for implementing depth aware object movement operations on digital images in accordance with one or more embodiments. Fig. 67 illustrates acts in accordance with one embodiment, alternative embodiments may omit, add, reorder, and/or modify any of the acts shown in fig. 67. In some implementations, the acts of fig. 67 are performed as part of a method. For example, in some embodiments, the acts of FIG. 67 are performed as part of a computer-implemented method. Alternatively, the non-transitory computer-readable medium may store instructions thereon that, when executed by at least one processor, cause the at least one processor to perform operations comprising the acts of fig. 67. In some embodiments, the system performs the actions of FIG. 67. For example, in one or more embodiments, a system includes at least one memory device including a depth prediction neural network. The system also includes at least one processor configured to cause the system to perform the actions of fig. 67.

A series of acts 6700 include an act 6702 for determining an object depth of a first object and a second object depicted in a digital image. For example, in some embodiments, act 6702 involves determining a first object depth for a first object depicted within the digital image and a second object depth for a second object depicted within the digital image.

In one or more embodiments, determining a first object depth of the first object includes: determining a first average object depth of the first object; and determining a second object depth of the second object includes determining a second average object depth of the second object. In some cases, determining the first object depth and the second object depth includes determining the first object depth and the second object depth using a depth prediction neural network.

In some embodiments, determining the first object depth and the second object depth includes preprocessing the digital image without user input to determine the first object depth and the second object depth. Furthermore, in some cases, scene-based image editing system 106 adds the first object depth and the second object depth to a semantic scene graph associated with the digital image. Thus, in some cases, scene-based image editing system 106 compares the first object depth and the second object depth with reference to the semantic scene graph in response to moving the first object to create an overlapping region between the first object and the second object.

The series of acts 6700 also includes an act 6704 for moving the first object to create an overlapping region with the second object. For example, in some cases, act 6704 involves moving the first object to create an overlap region between the first object and the second object within the digital image.

The series of acts 6700 further includes an act 6706 for modifying the digital image to occlude one object in the overlapping region based on the object depth. To illustrate, in some cases, act 6706 involves modifying the digital image to occlude the first object or the second object within the overlapping region based on the first object depth and the second object depth.

As shown in fig. 67, act 6706 includes a sub-act 6708 for obscuring a second object with a first object based on the object depth of the first object being small. For example, in some cases, the scene-based image editing system 106 determines that a first object depth of a first object is less than a second object depth of a second object. Thus, in some embodiments, modifying the digital image to occlude the first object or the second object within the overlapping region based on the first object depth and the second object depth comprises: based on determining that the first object depth is less than the second object depth, the digital image is modified to occlude a second object having the first object within the overlapping region. In some implementations, the scene-based image editing system 106 also moves the second object outside of the overlapping region within the digital image; and modifying the digital image to expose a portion of the second object that is occluded by the first object.

As shown in fig. 67, act 6706 further includes a sub-act 6710 for obscuring the first object with the second object based on the object depth of the first object being greater. To illustrate, in some cases, the scene-based image editing system 106 determines that a first object depth of a first object is greater than a second object depth of a second object. Thus, in some embodiments, modifying the digital image to occlude the first object or the second object within the overlapping region based on the first object depth and the second object depth comprises: based on determining that the first object depth is greater than the second object depth, the digital image is modified to occlude the first object having the second object within the overlapping region.

To provide an illustration, in one or more embodiments, the scene-based image editing system 106 determines a first object depth of a first object depicted within the digital image and a second object depth of a second object depicted within the digital image; moving the first object within the digital image such that a portion of the first object overlaps a portion of the second object; comparing the first object depth and the second object depth to identify an occluding object from the first object or the second object based on moving the first object; and modifying the digital image using the occluding object to occlude the portion of the first object or the portion of the second object.

In one or more embodiments, comparing the first object depth to the second object depth to identify an occluding object from the first object or the second object includes: comparing the first object depth and the second object depth to determine objects having a relatively smaller object depth; and modifying the digital image using the occluding object to occlude the portion of the first object or the portion of the second object comprises: the digital image is modified using the object having a relatively small object depth to obscure the portion of the first object or the portion of the second object.

In some embodiments, scene-based image editing system 106 also determines a third object depth for a third object depicted within the digital image; moving the third object within the digital image such that a portion of the third object overlaps the first object and the second object; and comparing the first object depth, the second object depth, and the third object depth to identify a first occluding object and a second occluding object from the first object, the second object, and the third object based on moving the third object. In some embodiments, comparing the first object depth, the second object depth, and the third object depth to identify the first occluding object and the second occluding object comprises: determining that the first object is a first occluding object based on determining that the first object depth is less than the second object depth and the third object depth; and determining that the third object is the second occluding object based on determining that the third object depth is greater than the first object depth and less than the second object depth. In some cases, the scene-based image editing system 106 further modifies the digital image by: masking the second object with the portion of the third object based on determining that the third object is the second masking object; and based on determining that the first object is the first occluding object, occluding the portion of the third object with the first object.

In some cases, the scene-based image editing system 106 also generates a content fill of the first object before detecting one or more user interactions for moving the first object; generating a complete background of the digital image using the content population of the first object; and exposing the content fill of the first object via the completed background as the first object is moved within the digital image. In some cases, scene-based image editing system 106 also generates a semantic scene graph of the digital image, the semantic scene graph including a first object depth of the first object and a second object depth of the second object; and referencing the semantic scene graph to retrieve the first object depth and the second object depth when determining to move the first object such that the portion of the first object overlaps the portion of the second object. In some implementations, modifying the digital image to occlude a portion of the first object or a portion of the second object using the occluding object includes: when a portion of the first object begins to overlap with a portion of the second object, the digital image is modified in real-time to reflect the occlusion of the pixel from the first object or the pixel of the second object by the occluding object.

To provide another illustration, in one or more embodiments, the scene-based image editing system 106 pre-processes the digital image using a depth prediction neural network to determine a first average object depth for a first object depicted in the digital image and a second average object depth for a second object depicted in the digital image; detecting one or more user interactions for moving a first object to overlap a second object; and modifying the digital image in response to the one or more user interactions by: moving the first object within the digital image to create an overlap region between the first object and the second object; determining that a first average object depth of the first object is less than a second average object depth of the second object based on moving the first object to overlap with the second object; and based on determining that the first average object depth is less than the second average object depth, occluding the second object with the first object within the overlapping region.

In one or more embodiments, moving the first object within the digital image to create an overlap region between the first object and the second object includes moving the first object within the digital image such that an entirety of the second object overlaps the first object; and obscuring the second object with the first object includes obscuring an entirety of the second object with the first object. In some cases, scene-based image editing system 106 also modifies the position of the first object or the position of the second object to create a subsequent overlapping region between the first object and the second object; and in response to creating the subsequent overlap region, occluding the second object with the first object within the subsequent overlap region. In some cases, scene-based image editing system 106 also moves the first object within the digital image to create an additional overlap region between the first object and a third object depicted in the digital image; determining that the first average object depth of the first object is greater than a third average object depth of the third object; and based on determining that the first average object depth is greater than the third average object depth, occluding the first object with the third object within the additional overlap region.

Embodiments of the present disclosure may include or utilize a special purpose or general-purpose computer including computer hardware, such as one or more processors and system memory, as discussed in more detail below. Embodiments within the scope of the present disclosure also include physical and other computer-readable media for carrying or storing computer-executable instructions and/or data structures. In particular, one or more processes described herein may be implemented at least in part as instructions embodied in a non-transitory computer readable medium and executable by one or more computing devices (e.g., any of the media content access devices described herein). In general, a processor (e.g., a microprocessor) receives instructions from a non-transitory computer-readable medium (e.g., memory) and executes the instructions, thereby performing one or more processes, including one or more of the processes described herein.

Computer readable media can be any available media that can be accessed by a general purpose or special purpose computer system. The computer-readable medium storing computer-executable instructions is a non-transitory computer-readable storage medium (device). The computer-readable medium carrying computer-executable instructions is a transmission medium. Thus, by way of example, and not limitation, embodiments of the disclosure may include at least two distinct types of computer-readable media: a non-transitory computer readable storage medium (device) and a transmission medium.

Non-transitory computer readable storage media (devices) include RAM, ROM, EEPROM, CD-ROM, solid state drives ("SSDs") (e.g., based on RAM), flash memory, phase change memory ("PCM"), other types of memory, other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to store desired program code means in the form of computer-executable instructions or data structures and that can be accessed by a general purpose or special purpose computer.

A "network" is defined as one or more data links capable of transmitting electronic data between computer systems and/or modules and/or other electronic devices. When information is transferred or provided over a network or another communications connection (either hardwired, wireless, or a combination of hardwired or wireless) to a computer, the computer properly views the connection as a transmission medium. The transmission media can include networks and/or data links which can be used to carry desired program code means in the form of computer-executable instructions or data structures, and which can be accessed by a general purpose or special purpose computer. Combinations of the above should also be included within the scope of computer-readable media.

Furthermore, upon reaching various computer system components, program code means in the form of computer-executable instructions or data structures can be transferred automatically from transmission media to non-transitory computer-readable storage media (devices) (or vice versa). For example, computer-executable instructions or data structures received over a network or data link may be cached in RAM within a network interface module (e.g., a "NIC") and then ultimately transferred to computer system RAM and/or less volatile computer storage media (devices) at the computer system. Thus, it should be understood that a non-transitory computer readable storage medium (device) can be included in a computer system component that also (or even primarily) utilizes transmission media.

Computer-executable instructions comprise, for example, instructions and data which, when executed by a processor, cause a general purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions. In some embodiments, computer-executable instructions execute on a general purpose computer to transform the general purpose computer into a special purpose computer that implements the elements of the present disclosure. The computer-executable instructions may be, for example, binary code, intermediate format instructions such as assembly language, or even source code. Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the features or acts described above. Rather, the described features and acts are disclosed as example forms of implementing the claims.

Those skilled in the art will appreciate that the invention may be practiced in network computing environments with many types of computer system configurations, including personal computers, desktop computers, laptop computers, message processors, hand-held devices, multiprocessor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, mobile telephones, PDAs, tablet computers, pagers, routers, switches, and the like. The disclosure may also be practiced in distributed system environments where local and remote computer systems, which are linked (either by hardwired data links, wireless data links, or by a combination of hardwired and wireless data links) through a network, both perform tasks. In a distributed system environment, program modules may be located in both local and remote memory storage devices.

Embodiments of the present disclosure may also be implemented in a cloud computing environment. In this specification, "cloud computing" is defined as a model for enabling on-demand network access to a shared pool of configurable computing resources. For example, cloud computing may be employed in the marketplace to provide ubiquitous and convenient on-demand access to a shared pool of configurable computing resources. The shared pool of configurable computing resources may be quickly provided via virtualization and released with low management workload or service provider interactions and then scaled accordingly.

Cloud computing models may be composed of various features, such as on-demand self-service, wide network access, resource pooling, rapid elasticity, measurement services, and the like. The cloud computing model may also expose various service models, such as software as a service ("saas"), a platform as a service ("paas"), and an infrastructure as a service ("iaas"). The cloud computing model may also be deployed using different deployment models, such as private cloud, community cloud, public cloud, hybrid cloud, and so forth. In this specification and in the claims, a "cloud computing environment" is an environment in which cloud computing is employed.

FIG. 68 illustrates a block diagram of an example computing device 6800 that can be configured to perform one or more of the processes described above. It will be appreciated that one or more computing devices, such as computing device 6800, may represent the computing devices described above (e.g., server 102 and/or client devices 110a-110 n). In one or more embodiments, the computing device 6800 can be a mobile device (e.g., a mobile phone, smart phone, PDA, tablet, laptop, camera, tracker, watch, wearable device). In some embodiments, computing device 6800 can be a non-mobile device (e.g., a desktop computer or another type of client device). Further, computing device 6800 can be a server device that includes cloud-based processing and storage capabilities.

As shown in fig. 68, computing device 6800 may include one or more processors 6802, memory 6804, storage device 6806, input/output interfaces 6808 (or "I/O interfaces 6808"), and communication interfaces 6810, which may be communicatively coupled via a communication infrastructure (e.g., bus 6812). Although computing device 6800 is shown in fig. 68, the components shown in fig. 68 are not intended to be limiting. Additional or alternative components may be used in other embodiments. Moreover, in certain embodiments, computing device 6800 includes fewer components than shown in FIG. 68. The components of computing device 6800 shown in fig. 68 will now be described in more detail.

In a particular embodiment, the processor 6802 includes hardware for executing instructions, such as those comprising a computer program. As an example of executing instructions and not by way of limitation, the processor 6802 may retrieve (or fetch) instructions from an internal register, an internal cache, the memory 6804, or the storage device 6806, and decode and execute them.

The computing device 6800 includes a memory 6804 coupled to the processor 6802. The memory 6804 may be used for storing data, metadata, and programs executed by the processor. Memory 6804 may include one or more of volatile and non-volatile memory, such as random access memory ("RAM"), read only memory ("ROM"), solid state disk ("SSD"), flash memory, phase change memory ("PCM"), or other types of data storage devices. The memory 6804 may be internal or distributed memory.

Computing device 6800 includes a storage device 6806, storage device 6806 including memory for storing data or instructions. By way of example, and not limitation, the storage device 6806 may comprise the non-transitory storage media described above. The storage device 6806 may include a Hard Disk Drive (HDD), flash memory, a Universal Serial Bus (USB) drive, or a combination of these or other storage devices.

As shown, computing device 6800 includes one or more I/O interfaces 6808 that are provided to allow a user to provide input (such as user strokes) to computing device 6800, to receive output from computing device 6800, and to otherwise communicate data to computing device 6800 and from computing device 6800. These I/O interfaces 6808 can include a mouse, a keypad or keyboard, a touch screen, a camera, an optical scanner, a network interface, a modem, other known I/O devices, or a combination of such I/O interfaces 6808. The touch screen may be activated with a stylus or finger.

The I/O interface 6808 can include one or more devices for presenting output to a user, including but not limited to a graphics engine, a display (e.g., a display screen), one or more output drivers (e.g., a display driver), one or more audio speakers, and one or more audio drivers. In some embodiments, the I/O interface 6808 is configured to provide graphical data to a display for presentation to a user. The graphical data may represent one or more graphical user interfaces and/or any other graphical content that may serve a particular implementation.

Computing device 6800 can also include communication interface 6810. Communication interface 6810 may include hardware, software, or both. Communication interface 6810 provides one or more interfaces for communication (e.g., packet-based communication) between the computing device and one or more other computing devices or one or more networks. By way of example, and not by way of limitation, communication interface 6810 may include a Network Interface Controller (NIC) or network adapter for communicating with an ethernet or other wire-based network, or a Wireless NIC (WNIC) or wireless adapter for communicating with a wireless network such as WI-FI. Computing device 6800 can also include bus 6812. Bus 6812 may include hardware, software, or both that connect the components of computing device 6800 to one another.

In the foregoing specification, the invention has been described with reference to specific exemplary embodiments thereof. Various embodiments and aspects of the invention are described with reference to details discussed herein and the accompanying drawings illustrate the various embodiments. The foregoing description and drawings are illustrative of the invention and are not to be construed as limiting the invention. Numerous specific details are described to provide a thorough understanding of various embodiments of the invention.

The present invention may be embodied in other specific forms without departing from its spirit or essential characteristics. The described embodiments are to be considered in all respects only as illustrative and not restrictive. For example, the methods described herein may be performed with fewer or more steps/acts, or the steps/acts may be performed in a different order. Additionally, the steps/acts described herein may be repeated or performed in parallel with each other or in parallel with different instances of the same or similar steps/acts. The scope of the invention is, therefore, indicated by the appended claims rather than by the foregoing description. All changes which come within the meaning and range of equivalency of the claims are to be embraced within their scope.

Claims

1. A computer-implemented method, comprising:

determining vanishing points associated with the digital image depicting the object;

detecting one or more user interactions for moving the object within the digital image; and

perspective-based resizing of the object within the digital image is performed based on moving the object relative to the vanishing point.

2. The computer-implemented method of claim 1, wherein:

Moving the object within the digital image relative to the vanishing point includes moving the object toward the vanishing point on a line from the object to the vanishing point; and

performing the perspective-based resizing of the object based on moving the object relative to the vanishing point includes reducing a size of the object based on moving the object toward the vanishing point.

3. The computer-implemented method of claim 1, wherein:

performing the perspective-based resizing of the object based on moving the object relative to the vanishing point includes increasing a size of the object based on moving the object away from the vanishing point.

4. The computer-implemented method of claim 1,

further comprising generating a perspective scale for the object based on moving the object within the digital image,

wherein performing the perspective-based resizing of the object within the digital image comprises performing the perspective-based resizing of the object using the perspective scaling for the object.

5. The computer-implemented method of claim 4, wherein generating the perspective scale for the object based on moving the object within the digital image comprises:

determining a first distance of the object from a horizontal line associated with the digital image before the object is moved;

determining a second distance of the object from the horizon after the object is moved; and

the perspective scale is generated using the first distance and the second distance.

6. The computer-implemented method of claim 1,

wherein moving the object within the digital image relative to the vanishing point comprises moving the object to a position in which a portion of the object overlaps with a portion of an additional object depicted within the digital image; and is also provided with

Further comprising modifying the digital image to occlude the portion of the object or the portion of the additional object based on comparing the object depth of the object to the object depth of the additional object.

7. The computer-implemented method of claim 6,

further comprising determining that the object depth of the object is greater than the object depth of the additional object based on comparing the object depth of the object with the object depth of the additional object,

Wherein modifying the digital image to occlude the portion of the object or the portion of the additional object comprises: the digital image is modified to occlude the portion of the object using the portion of the additional object based on determining that the object depth of the object is greater than the object depth of the additional object.

8. The computer-implemented method of claim 1, further comprising providing one or more perspective-based dimensional previews along a line from the object to the vanishing point for display within the digital image upon selection of the object to move within the digital image.

9. The computer-implemented method of claim 1, further comprising:

generating a content fill for the object within the digital image prior to detecting the one or more user interactions for moving the object; and

the content fill within the digital image is exposed in response to moving the object relative to the vanishing point.

10. A non-transitory computer-readable medium having instructions stored thereon, which when executed by at least one processor, cause the at least one processor to perform operations comprising:

Determining a vanishing point associated with the digital image depicting the object at the first location within the digital image;

moving the object to a second position within the digital image relative to the vanishing point in response to one or more user interactions with the object;

determining a perspective scale for the object based on the first location and the second location; and

perspective-based resizing of the object within the digital image is performed using perspective scaling.

11. The non-transitory computer-readable medium of claim 10, wherein determining the perspective scale for the object based on the first location and the second location comprises:

detecting a horizontal line associated with the digital image;

determining a first distance from the first location to the horizontal line and a second distance from the second location to the horizontal line; and

the perspective scaling is determined using the first distance and the second distance.

12. The non-transitory computer-readable medium of claim 11, wherein detecting the horizontal line for the digital image comprises detecting the horizontal line using a neural network.

13. The non-transitory computer-readable medium of claim 10, wherein determining the vanishing point associated with the digital image comprises determining a vanishing point located outside of the digital image.

14. The non-transitory computer-readable medium of claim 10, wherein the operations further comprise:

generating a plurality of perspective-based dimensional previews for the object, the plurality of perspective-based dimensional previews indicating a size for the object at a location other than the first location within the digital image;

detecting a selection of movement of the object within the digital image; and

in response to the selection, the plurality of perspective-based dimensional previews along a line from the object to the vanishing point are provided for display within the digital image.

15. The non-transitory computer-readable medium of claim 14, wherein generating the plurality of perspective-based dimensional previews for the object comprises indicating the size for the object at the location within the digital image other than the first location comprises:

generating a first perspective-based dimensional preview indicating a dimension greater than a dimension of the object at the first location, the greater dimension corresponding to a location farther from the vanishing point than the first location; and

A second perspective-based dimensional preview is generated, the second perspective-based dimensional preview indicating a smaller size than a size of the object at the first location, the smaller size corresponding to a location closer to the vanishing point than the first location.

16. The non-transitory computer readable medium of claim 10, wherein:

the operations further include moving the vanishing point associated with the digital image to a modified location in response to at least one user interaction with the vanishing point; and is also provided with

Moving the object relative to the vanishing point to the second position within the digital image includes moving the object relative to the vanishing point at the modified position to the second position within the digital image.

17. A system, comprising:

at least one memory device comprising a depth prediction neural network; and

at least one processor configured to cause the system to:

detecting one or more user interactions for moving a first object depicted within a digital image relative to a vanishing point associated with the digital image; and

modifying the digital image in response to the one or more user interactions with the first object by:

Moving the first object within the digital image from a first position to a second position relative to the vanishing point, a portion of the first object at the second position overlapping a portion of a second object depicted within the digital image;

performing perspective-based resizing of the first object within the digital image based on moving the first object from the first position to the second position; and

based on the second position of the first object and an object depth of the second object, the portion of the first object is occluded with the portion of the second object.

18. The system of claim 17, further comprising:

determining an object depth at the second location for the first object using the depth prediction neural network; and

occlusion of the portion of the first object with the portion of the second object based on the second position of the first object and the object depth of the second object comprises: based on the object depth of the first object and the object depth of the second object, the portion of the first object is occluded with the portion of the second object.

19. The system of claim 18, wherein occluding the portion of the first object with the portion of the second object based on the object depth of the first object and the object depth of the second object comprises: based on determining that the object depth of the first object is greater than the object depth of the second object, the portion of the first object is occluded with the portion of the second object.

20. The system of claim 17, wherein the at least one processor is further configured to cause the system to:

moving the first object from the second position to a third position within the digital image relative to the vanishing point, an additional portion of the first object at the third position overlapping with an additional portion of the second object; and

based on the updated object depth of the first object and the object depth of the second object at the third location, the additional portion of the second object is occluded with the additional portion of the first object.