WO2023056559A1

WO2023056559A1 - Systems and methods for compositing a virtual object in a digital image

Info

Publication number: WO2023056559A1
Application number: PCT/CA2022/051479
Authority: WO
Inventors: Mathieu Garon; Etienne DUBEAU; Mathieu SAINT-DENIS
Original assignee: Depix Technologies Inc.
Priority date: 2021-10-06
Filing date: 2022-10-06
Publication date: 2023-04-13

Abstract

A system and a method for compositing a virtual object in an input digital image acquired with a camera is disclosed. Before compositing, a set of camera calibration parameters corresponding to the camera is extracted, ground planes and object orthogonal to a ground plane are detected and delineated a tridimensional scene is inferred by back-projecting the ground planes to tridimensional polygons and the objects to tridimensional planes, and the virtual object is inserted and rended at a specified location in the tridimensional scene orthogonally a ground plane. Steps can be implemented using artificial neural networks. Lighting parameters can additionally be inferred or provided to cast a shadow for the virtual object on the ground plane. When the virtual object is cut out of the input digital image, inpainting can be implemented.

Description

SYSTEMS AND METHODS FOR COMPOSITING A VIRTUAL OBJECT IN A DIGITAL IMAGE

[0000] This application claims the benefit of and priority to United States Provisional Patent Application No. 63/252,866, filed October 6, 2021 , the contents of which is hereby incorporated by reference.

TECHNICAL FIELD

[0001] The following relates to the field of image processing and more specifically to the field of image editing.

BACKGROUND

[0002] Existing systems and methods for compositing virtual objects in two- dimensional digital images representing tridimensional scenes usually involve creating a tridimensional scene model in which objects can be rendered. However, creating such a scene model is a complicated task, and the resulting models can contain numerous inaccuracies that limit the realistic quality of the generated images. There is therefore room for improvement.

SUMMARY

[0003] According to an aspect, there is provided a method for compositing a virtual object in an input digital image acquired with a camera, the method including: extracting a set of camera calibration parameters corresponding to the camera; detecting and delineating at least one ground plane; detecting and delineating at least one object orthogonal to one of the at least one ground plane; inferring a tridimensional scene by back-projecting each of the at least one ground plane to at least one corresponding tridimensional polygon and each of the at least one object to at least one corresponding tridimensional plane; inserting and rendering the virtual object at a specified location in the tridimensional scene orthogonally to one of the at least one ground plane; and compositing the rendered virtual object in the input digital image. [0004] In some embodiments, the camera calibration parameters are estimated using a camera calibration neural network trained to map a calibration input including the input digital image to the set of camera calibration parameters.

[0005] In some embodiments, training the camera calibration neural network includes: receiving a set of panoramic images; creating a plurality of sample images, wherein each of the plurality of sample images is a reprojection of a portion of one random panoramic image from the set of panoramic images using random camera parameters; and optimizing the camera calibration neural network with each of the plurality of sample images and the corresponding random camera parameters.

[0006] In some embodiments, the at least one ground plane is detected and delineated using a segmentation neural network trained to map a segmentation input including the input digital image to an output including the delineation of the at least one ground plane.

[0007] In some embodiments, the segmentation input further includes coordinates of at least one activation of a pointing device.

[0008] In some embodiments, the coordinates include positive coordinates and negative coordinates.

[0009] In some embodiments, the detected at least one object is detected and delineated using the segmentation neural network, wherein the output further includes the delineation of the detected at least one object.

[0010] In some embodiments, the delineation of the at least one ground plane and the delineation of the detected at least one object are two-dimensional delineations.

[0011] In some embodiments, at least one of the delineation of the at least one ground plane and the delineation of the detected at least one object is an array of coordinates corresponding to a contour. [0012] In some embodiments, at least one of the delineation of the at least one ground plane and the delineation of the detected at least one object is a segmentation mask.

[0013] In some embodiments, the segmentation mask is an alpha matte.

[0014] In some embodiments, rendering the virtual object in the tridimensional scene includes scaling the virtual object.

[0015] In some embodiments, in response to the virtual object being inserted at least partially behind at least one of the detected at least one object in the tridimensional scene, rendering the virtual object includes occluding the virtual object with respect to the delineation of the detected at least one object.

[0016] In some embodiments, the method further includes the step of estimating lighting parameters, and wherein rendering the virtual object includes casting a shadow on the corresponding one of the at least one ground plane with respect to the estimated lighting parameters.

[0017] In some embodiments, the method further includes the step of defining arbitrary lighting parameters, and wherein rendering the virtual object includes casting a shadow on the corresponding one of the at least one ground plane with respect to the arbitrary lighting parameters.

[0018] In some embodiments, in response to the shadow being cast at least partially behind at least one of the detected at least one object in the tridimensional scene, casting the shadow includes occluding the shadow with respect to the delineation of the detected at least one object.

[0019] In some embodiments, the virtual object is cropped from a second digital image.

[0020] In some embodiments, the virtual object is cropped from the input digital image. [0021] In some embodiments, the method further includes inpainting an area of the input digital image corresponding to cropped pixels of the virtual object.

[0022] According to another aspect, there is provided a system for compositing a virtual object in an input digital image acquired with a camera, the system including: a user input device; a calibration parameter extraction module configured to extract a set of camera calibration parameters corresponding to the camera; a detection and delineation configured to: detect and delineate at least one ground plane, and detect and delineate at least one object orthogonal to the ground plane; a back- projection module configured to infer a tridimensional scene by back-projecting each of the at least one ground plane to at least one corresponding tridimensional polygon and each of the at least one object to at least one corresponding tridimensional plane; an insertion module configured to allow for the insertion of the virtual object at a specified location in the tridimensional scene by the user input device; a rendering module configured to render the virtual object at the specified location in the tridimensional scene orthogonally to a corresponding one of the at least one ground plane; and a compositing module configured to composite the rendered virtual object in the input digital image.

[0023] In some embodiments, the user input device is a pointing device.

[0024] In some embodiments, the calibration parameter extraction module includes a camera calibration neural network trained to map a calibration input including the input digital image to the set of camera calibration parameters.

[0025] In some embodiments, the system further includes a camera calibration neural network training module configured to: create a plurality of sample images, wherein each of the plurality of sample images is a reprojection of a portion of one random panoramic image from a set of panoramic images using random camera parameters; and optimizing the camera calibration neural network with each of the plurality of sample images and the corresponding random camera parameters. [0026] In some embodiments, the detection and delineation module includes a segmentation neural network trained to map a segmentation input including the input digital image to an output including the delineation of the at least one ground plane and the at least one detected objects.

[0027] In some embodiments, the segmentation input further includes coordinates obtained from the user input device.

[0028] In some embodiments, the coordinates include positive coordinates and negative coordinates.

[0029] In some embodiments, the delineation of the at least one ground plane and the delineation of the detected at least one object are two-dimensional delineations.

[0030] In some embodiments, at least one of the delineation of the at least one ground plane and the delineation of the detected at least one object is an array of coordinates corresponding to a contour.

[0031] In some embodiments, at least one of the delineation of the at least one ground plane and the delineation of the detected at least one object is a segmentation mask.

[0032] In some embodiments, the segmentation mask is an alpha matte.

[0033] In some embodiments, the system further includes a scaling module configured define a scale of the input image, wherein the rendering module is further configured to scale the virtual object with respect to the scale of the input image.

[0034] In some embodiments, the rendering module is further configured to scale the virtual object with respect to an arbitrary scale.

[0035] In some embodiments, the insertion module is further configured to detect that the virtual object is being inserted at least partially behind at least one of the detected at least one object in the tridimensional scene, and wherein the rendering module is further configured in response to the virtual object being inserted at least partially behind at least one of the detected at least one object in the tridimensional scene the virtual object includes to occlude the virtual object with respect to the delineation of the detected at least one object.

[0036] In some embodiments, the system further includes a lighting parameter estimation module configured to estimate lighting parameters, wherein the rendering module is further configured to cast a shadow of the virtual object on the ground plane with respect to the estimated lighting parameters.

[0037] In some embodiments, the rendering module is further configured to cast a shadow of the virtual object on the ground plane with respect to arbitrary lighting parameters.

[0038] In some embodiments, the rendering module is further configured to detect that the shadow is being cast at least partially behind at least one of the detected at least one object in the tridimensional scene and in response to the shadow being cast at least partially behind at least one of the detected at least one object in the tridimensional scene to occlude the shadow with respect to the delineation of the detected at least one object.

[0039] In some embodiments, the system further includes a cropping module configured to acquire the virtual object by cropping a source digital image.

[0040] In some embodiments, the system further includes an inpaiting module configured in response to the source digital image being the input digital image to inpaint an area of the input digital image corresponding to cropped pixels of the virtual object.

[0041] According to a further aspect, there is provided a non-transitory computer readable medium having recorded thereon statements and instructions for compositing a virtual object in an input digital image acquired with a camera, said statements and instructions when executed by at least one processor causing the at least one processor to: extract a set of camera calibration parameters corresponding to the camera; detect and delineate at least one ground plane; detect and delineate at least one object orthogonal to one of the at least one ground plane; infer a tridimensional scene by back-projecting each of the at least one ground plane to at least one corresponding tridimensional polygon and each of the at least one object to at least one corresponding tridimensional plane; insert and render the virtual object at a specified location in the tridimensional scene orthogonally to one of the at least one ground plane; and composite the rendered virtual object in the input digital image.

[0042] In some embodiments, the camera calibration parameters are estimated using a camera calibration neural network trained to map a calibration input including the input digital image to the set of camera calibration parameters.

[0043] In some embodiments, training the camera calibration neural network includes: receiving a set of panoramic images; creating a plurality of sample images, wherein each of the plurality of sample images is a reprojection of a portion of one random panoramic image from the set of panoramic images using random camera parameters; and optimizing the camera calibration neural network with each of the plurality of sample images and the corresponding random camera parameters.

[0044] In some embodiments, the at least one ground plane is detected and delineated using a segmentation neural network trained to map a segmentation input including the input digital image to an output including the delineation of the at least one ground plane.

[0045] In some embodiments, the segmentation input further includes coordinates of at least one activation of a pointing device.

[0046] In some embodiments, the coordinates include positive coordinates and negative coordinates. [0047] In some embodiments, the detected at least one object is detected and delineated using the segmentation neural network, wherein the output further includes the delineation of the detected at least one object.

[0048] In some embodiments, the delineation of the at least one ground plane and the delineation of the detected at least one object are two-dimensional delineations.

[0049] In some embodiments, at least one of the delineation of the at least one ground plane and the delineation of the detected at least one object is an array of coordinates corresponding to a contour.

[0050] In some embodiments, at least one of the delineation of the at least one ground plane and the delineation of the detected at least one object is a segmentation mask.

[0051] In some embodiments, the segmentation mask is an alpha matte.

[0052] In some embodiments, rendering the virtual object in the tridimensional scene includes scaling the virtual object.

[0053] In some embodiments, in response to the virtual object being inserted at least partially behind at least one of the detected at least one object in the tridimensional scene, rendering the virtual object includes occluding the virtual object with respect to the delineation of the detected at least one object.

[0054] In some embodiments, the statements and instructions further cause the at least one processor to estimate lighting parameters, and wherein rendering the virtual object includes casting a shadow on the corresponding one of the at least one ground plane with respect to the estimated lighting parameters.

[0055] In some embodiments, the statements and instructions further cause the at least one processor to define arbitrary lighting parameters, and wherein rendering the virtual object includes casting a shadow on the corresponding one of the at least one ground plane with respect to the arbitrary lighting parameters. [0056] In some embodiments, in response to the shadow being cast at least partially behind at least one of the detected at least one object in the tridimensional scene, casting the shadow includes occluding the shadow with respect to the delineation of the detected at least one object.

[0057] In some embodiments, the virtual object is cropped from a second digital image.

[0058] In some embodiments, the virtual object is cropped from the input digital image.

[0059] In some embodiments, the statements and instructions further cause the at least one processor to inpaint an area of the input digital image corresponding to cropped pixels of the virtual object.

BRIEF DESCRIPTION OF THE DRAWINGS

[0060] For a better understanding of the embodiments described herein and to show more clearly how they may be carried into effect, reference will now be made, by way of example only, to the accompanying drawings which show at least one exemplary embodiment.

[0061] Figure 1 is a schematic of a system for rendering a virtual object in an input digital image corresponding to a perspective of a tridimensional scene, according to an embodiment.

[0062] Figure 2 is a schematic of a method for rendering a virtual object in an input digital image corresponding to a perspective of a tridimensional scene, according to an embodiment.

[0063] Figures 3A and 3B respectively show the interactive segmentation of a person and of a ground in an image, according to an embodiment.

[0064] Figures 4A, 4B and 4C respectively illustrate first, second and third steps for acquiring a parametric model of a ground plane, according to an embodiment. [0065] Figures 5A, 5B, 5C and 5D are schematics illustrating camera calibration and image insertion steps, according to an embodiment.

[0066] Figures 6A, 6B, 6C and 6D are schematics illustrating camera calibration and image insertion steps, according to another embodiment.

[0067] Figure 7A show a table shape plane with a cube image placed atop the table plane; and Figures 7B and 7C show the table shape plane and cube of Figure 7A in which the shadow respectively falls over the edge of the table and behind an object.

[0068] Figures 8A and 8B illustrate the scaling of a ground plane, according to possible embodiments.

[0069] Figures 9A, illustrates an input digital image; Figures 9B and 9C respectively illustrate an exemplary virtual object inserted in the image in front of and behind an object; Figure 9D illustrates moving object to different positions while applying inpainting.

[0070] Figure 10 is a flow chart illustrating a method for contour selection using a neural network, according to an embodiment.

[0071] Figure 11 is a flow chart illustrating a method for contour selection using a neural network, according to another embodiment.

[0072] Figure 12 is a flow chart illustrating a method for camera calibration using a neural network, according to an embodiment.

[0073] Figure 13 is a flow chart illustrating a method for interactive segmentation using a neural network, according to an embodiment.

DETAILED DESCRIPTION

[0074] It will be appreciated that, for simplicity and clarity of illustration, where considered appropriate, reference numerals may be repeated among the figures to indicate corresponding or analogous elements or steps. In addition, numerous specific details are set forth in order to provide a thorough understanding of the exemplary embodiments described herein. However, it will be understood by those of ordinary skill in the art that the embodiments described herein may be practised without these specific details. In other instances, well-known methods, procedures and components have not been described in detail so as not to obscure the embodiments described herein. Furthermore, this description is not to be considered as limiting the scope of the embodiments described herein in any way but rather as merely describing the implementation of the various embodiments described herein.

[0075] With the advent of social media and having constant access to images, it is becoming increasingly important to develop methods of image editing that make the image edits as realistic as possible. For instance, inserting virtual content in an already existing image realistically require geometric knowledge from the original captured image which can be challenging to retrieve. For example, inserting some 3D virtual content such as an object in a standard RGB image comports various challenging aspects. In order to render the object realistically, it can be important to know the scene geometry. Knowing the scene geometry can be important to position the object correctly, and to render realistic shadows and reflections cast by the object into the scene. A method of creating a 3-dimensional plane segment is taught herein. It can be appreciated that one of the uses of the 3D plane segment is to place virtual objects realistically into the plane segment assisted framework to determine the 3D position of a plane using a minimal set of inputs (for example, 1-3 clicks), thus allowing a user to quickly build a very simple 3D model of the scene (as a planar segment) from an image with a minimum number of clicks.

[0076] One or more systems described herein may be implemented in computer program(s) executed on processing device(s), each comprising at least one processor, a data storage system (including volatile and/or non-volatile memory and/or storage elements), and optionally at least one input and/or output device. “Processing devices” encompass computers, servers and/or specialized electronic devices which receive, process and/or transmit data. As an example, “processing devices” can include processing means, such as microcontrollers, microprocessors, and/or CPUs, or be implemented on FPGAs. For example, and without limitation, a processing device may be a programmable logic unit, a mainframe computer, a server, a personal computer, a cloud-based program or system, a laptop, a personal data assistant, a cellular telephone, a smartphone, a wearable device, a tablet, a video game console or a portable video game device.

[0077] Each program is preferably implemented in a high-level programming and/or scripting language, for instance an imperative e.g., procedural or object- oriented, or a declarative e.g., functional or logic, language, to communicate with a computer system. However, a program can be implemented in assembly or machine language if desired. In any case, the language may be a compiled or an interpreted language. Each such computer program is preferably stored on a storage media or a device readable by a general or special purpose programmable computer for configuring and operating the computer when the storage media or device is read by the computer to perform the procedures described herein. In some embodiments, the system may be embedded within an operating system running on the programmable computer.

[0078] Furthermore, the system, processes and methods of the described embodiments are capable of being distributed in a computer program product comprising a computer readable medium that bears computer-usable instructions for one or more processors. The computer-usable instructions may also be in various forms including compiled and non-compiled code.

[0079] The processor(s) are used in combination with storage medium, also referred to as “memory” or “storage means”. Storage medium can store instructions, algorithms, rules and/or trading data to be processed. Storage medium encompasses volatile or non-volatile/persistent memory, such as registers, cache, RAM, flash memory, ROM, diskettes, compact disks, tapes, chips, as examples only. The type of memory is, of course, chosen according to the desired use, whether it should retain instructions, or temporarily store, retain or update data. Steps of the proposed method are implemented as software instructions and algorithms, stored in computer memory and executed by processors.

[0080] With reference to Figure 1 , an exemplary system 1 configured for compositing a virtual object in an input digital image acquired with a camera is shown. In the illustrated embodiment, the system 1 comprises a user device 40 and a backend device 100. The user device 40 can comprise a processing device adapted to allow a user to select an input digital image 10 and a virtual object 20, and request that virtual object 20 be inserted into digital image 10 at a user- specified location, such that a new composited digital image 30 is generated according to a rendering of the virtual object 20 in the digital image 10. In the present embodiment, the processing to generate the composited digital image 30 from the input digital image 10 and the virtual object 20 is performed on backend device 100, which can also comprise a processing device. In the illustrated embodiment, the backend device 100 is a different processing device than user device 40, but it is appreciated that other configurations are possible.

[0081] The input digital image 10 can correspond to a digital depiction of scene, such as a digital photograph of a scene. The scene can include a scene layout, such as one or more objects positioned relative to an environment, such as a ground, walls and/or a ceiling of given dimensions. The scene can further include one or more lighting sources illuminating objects in the scene and/or the scene environment. The digital image can depict a given perspective of the scene, for example representing a tridimensional scene as a two-dimensional image from the perspective of a physical or virtual camera used to capture the digital image. As can be appreciated, the digital image 10 may only contain limited information about the scene. For example, the digital image 10 can depict portions of the scene layout, environment, and lighting within a field of view of the camera used to capture the digital image, while not including portions of the scene outside the camera field of view. As can be appreciated, the scene being depicted can be a real scene, such as an image of physical objects in a physical environment, a virtual scene, such as an image of virtual objects in a virtual environment, and/or a mix thereof.

[0082] The virtual object 20 can correspond to a computer-generated object that can be inserted into the scene depicted by the input digital image 10 to produce the new, composited digital image 30. In some embodiments, the virtual object 20 can correspond to a cropped portion of a source image that a user wants to insert into the scene depicted by the input digital image 10. The source image can be a different image than the input digital image 10, or the same image as the input digital image such that the cropped object can be re-inserted into the scene. The virtual object 20 can be of a predefined shape/size and have different reflectance properties. As will be described in more detail hereinafter, the system 1 can include modules for estimating different parameters of the image and the corresponding scene, such that the virtual object 20 can be rendered at a desired position in the scene while taking into account camera calibration parameters, lighting parameters and layout/environment to realistically render the virtual object 20.

[0083] User device 40 comprises a user input device adapted to allow users to specify coordinates of the input digital image 10, for instance a keyboard, a speech-to-text processor or a pointing device. As can be appreciated, user device 40 can comprise any processing device having one or more user input devices integrated therein and/or interfaced therewith. For example, user device 40 can be a personal computer, a laptop computer, a tablet or a smartphone. User device 40 can be equipped with an output device, such as a display adapted to show the input digital image 10, and an input device such as a mouse, a trackpad, a touch panel 45 or other pointing device that can be used by a user to specify a set of coordinates of the digital image 10 by clicking and/or touching at corresponding positions of the display showing the digital image 10. The user device 40 is configured to allow a user to select a digital image 10 as well as a virtual object 20 for insertion, along with additional user inputs including at least the specification of an insertion position 145, and to receive a composited digital image that can be displayed, saved and/or shared on a social media platform.

[0084] The processing required to generate a composited digital image 30 from an input digital image 10, a virtual object 20 and at least an insertion position 145 can be implemented in various modules of a backend device 100. The backend device 100 can correspond to the user device 40 or to a different processing device that user device 40 can be in communication with, for instance through a network link. It can be appreciated that, where the user device 40 and the backend device 100 are different processing devices, not all modules need to be implemented on the same device. As an example, certain modules can be implemented on user device 40, other modules can be implemented on backend device 100, and yet other modules can be totally or partially implemented redundantly on both devices 40, 100 such that the functionality of these modules can be obtained on the most advantageous device, having respect for instance to the quality and availability of processing power or of the communication link.

[0085] Backend device 100 can include a calibration module 110, configured to acquire camera parameters 115 from input digital image 10. Typically, the calibration process enables the retrieval of information tied to the capture device for image formation. Standard perspective cameras are modelled by the pinhole model, where intrinsic information about the lens system defines how each light ray converges to a focal point to finally touch the image plane. As an example, Zhang, Zhengyou, “Flexible camera calibration by viewing a plane from unknown orientations”, Proceedings of the Seventh IEEE International Conference on Computer Vision, 1999, the contents of which is hereby incorporated by reference, proposes to retrieve this information using a known planar pattern. With the knowledge of the planar pattern beforehand and matching it to the image, one can estimate the camera parameters such as the intrinsic parameters and the position of the camera with respect to the target. It is assumed that the camera is not “rolled” such that the camera up vector is pointing toward the sky. This is a standard assumption with typical camera system as images are rarely captured with a roll angle. It is also assumed that the camera intrinsic parameters such as focal length is known, and the view centre coincides with the image centre. As another example, Wang, Guanghui, et al., “Camera calibration and 3D reconstruction from a single view based on scene constraints”, Image and Vision Computing 23.3:311- 323 (2005) and Wilczkowiak, Marta, Edmond Boyer, and Peter Sturm, “Camera calibration and 3D reconstruction from single images using parallelepipeds”, Proceedings the Eighth IEEE International Conference on Computer Vision, 2001 , the contents of which is hereby incorporated by reference, establish the parameters using cuboid shapes typically used for architecture modelling from single images. However, these methods are more difficult to use for plane selection as they require a user to select 4 points and use constraints such as collinearity and coplanarity to construct a 3D parallelepiped. As another example, an artificial neural network can be used to infer camera parameters 115 from a digital image 10, as described for instance in U.S. Patent No. 10,515,460, the contents of which is hereby incorporated by reference. Figure 12 shows a flow chart depicting a method for camera calibration 111 using a neural network according to an example embodiment of calibration module 110. The neural network is trained on a large dataset of panoramic images. During training, an image with a random field of view and camera orientation is extracted from the panorama and provided to the neural network. The neural network contains a convolutional backbone that takes a standard RGB image as input followed by a fully connected neural network that outputs various geometric camera parameters 115 of the image formation process. The parameters include but are not limited to: a 3D vector representing the position of the camera with respect to a plane segment, a camera pan and tilt angle and a set of points defining a polygon on the ground plane. The network is trained using a large dataset of panoramic images, i.e. , images exhibiting a 360-degree field of view. During training, an image with a random field of view and camera orientation is extracted from the panorama and provided to the neural network. To train the network, heavy data augmentation is used: the camera parameters are randomly sampled within realistic ranges and a random panorama is selected. From the generated camera parameters, a crop image representing those parameters is extracted from the panorama using reprojection, and passed through the network for optimization using the gradients computed from the camera parameters used. This process is repeated with a large number of parameters and the full panorama dataset multiple time during training. At inference time, the limited field of view of standard camera digital image 10 is provided to the network and the parameters 115 are directly estimated.

[0086] Referring again to Figure 1 , backend device 100 can include a detection and delineation module 120, configured to detect features in digital image 10 and segment the boundaries of these features, thereby obtaining a segmentation 125. The segmentation process can for instance be an instance segmentation process, where only certain features of interest are detected and delineated in segmentation 125. To facilitate realistic virtual object insertion in digital images, it can be desirable in particular to detect and delineate one or more ground planes, corresponding to one or more surfaces in the scene depicted by a digital image 10, such as the ground or the surface of a table, a shelf, etc., as well as objects that are positioned on the one or more ground planes. An object segmented in this fasion can be represented as a two-dimensional (2D) object that lies on a plane orthogonal relative to one or more ground planes in the tridimensional scene. With an accurate delineation of ground planes and orthogonal objects, inserting virtual objects on a ground plane, in front of or behind an object, optionally casting a shadow of the virtual object on the ground plane, can be greatly facilitated. An accurate delineation can advantageously be obtained interactively with a user inputting coordinates that are to be included in a plane or object. As an example, Figure 3A shows the segmentation of a person 330 in an image, and Figure 3B shows the segmentation of the ground 320 in the same image. It can be appreciated that the user only needs to enter one set of coordinates, for instance with one click 390, to define the contour of each semantic part of the scene. Different implementations of the detection and delineation module 120 are possible. Some are known to the art, for instance using a Hough transform as described in Okada, Kei, et al., “Plane segment finder: algorithm, implementation and applications”, Proceedings the IEEE International Conference on Robotics and Automation, 2001 , the contents of which is hereby incorporated by reference. For instance, a plane can be defined using image features and simple user inputs as shown in Figures 4A to 4C. The user can intuitively select at least one point in the image to create a tridimensional (3D) plane that overlays the original image scene. This selects a ground plane, for instance by specifying an array of coordinates of the digital image 10 that correspond to the contour of the plane. A virtual object can then be inserted into the ground plane. A ground plane geometry also allows a proper casting of the shadow to be rendered. The ground plane is not necessarily a flat plane but can be curved, rounded, jagged, textured or flat plane. In one embodiment, the user can be required to enter input values to solve for ambiguities. For instance, the user may be required to select a number of image coordinates. In one instance, the user can select 3 image coordinates that can represent the 3 comers of the 3D plane segment, assuming that the 3 comers are coplanar and form right angles. This allows the algorithm to solve for the missing 6 parameters with trigonometry. By applying the appropriate constraints and using lines drawn in the image space, a large set of unknown parameters can be derived. In another embodiment, this method provides a simple interface for a user to create 3D plane segments using at least one swipe (effectively selecting 3 2D coordinates). Such input can be useful for mobile-based user interaction: selecting more than 4 points is less user-friendly and selecting vanishing points that lie outside of the image is a limitation for devices with limited screen size. Using this ground plane selection technique, the geometry can be used for virtual object composition by using the plane itself to orient the object and the plane geometry to cast realistic shadows. In this embodiment, a first click 491 , a first swipe to 492 and a second swipe to 493 are needed to determine the ground plane 420 position. However, it can be appreciated that any number of swipes is allowable. Figure 13 shows a flow chart depicting an alternative implementation of a detection and delineation module 121 to obtain a segmented fragment 126 of an input digital image 10 interactively with user-provided coordinates, or “clicks,” via a neural network. In such an implementation, segmentation 126 will most often correspond to a segmentation mask e.g., a matrix having a size equal to the resolution of digital image 10, where each element of the matrix corresponds to a pixel of the image and indicate whether it is part of a detected and delineated feature. Similar systems are known in the art and described for instance in U.S. Patent Application Publication No. 2008/0136820 A1 , the contents of which is hereby incorporated by reference. The interactive segmentation neural network allows a user to select components in a scene with a single click. The neural network defines a segmentation mask based on the semantic of the image and the click position. From that segmentation, the segmentation contour pixels can be used to define how the tridimensional plane will be cut out. A detection and delineation module 121 can also take as input two sets of coordinates, one being defined as “positive” coordinates, containing coordinates that are specified as being a part of the feature to be delineated, the other being defined as “negative” coordinates, containing coordinates that are specified as not being part of the feature to be delineated. It can be appreciated that once a segmentation 126 has been inferred by the detection and delineation module 121 , it can be displayed to the user and the user may be given the opportunity to change the sets of coordinates, for instance by adding additional “positive” and/or “negative” coordinates, such that the delineation algorithm can be run again, with more input data, possibly yielding a higher-quality segmentation 126. In some embodiments, a detection and delineation module 120, 121 can be configured to produce as output an alpha matte, as shown for instance in U.S. Patent No. 11 ,004,208, the contents of which is hereby incorporated by reference. The alpha matte is a type of segmentation mask where each element of the matrix indicate whether the corresponding pixel of the digital image 10 is part of the feature and additionally what is the opacity of the feature in that pixel with respect to the background.

[0087] Referring once more to Figure 1 , the backend device 100 can include a projection module 130. Once camera parameters 115, including for instance the position and the orientation of the camera that captured the digital image 10, and a segmentation 125 of one or more ground plane and object orthogonal to a ground plane is known, projection module 130 can construct a simple tridimensional representation of the scene 135 comprising tridimensional shapes 135. Each of the ground planes delineated by the detection and delineation module 120 is back- projected as a corresponding tridimensional polygon using the camera parameters 115 and simple trigonometric functions. Each of the objects delineated by the detection and delineation module 120 is similarly back-projected as a corresponding tridimensional plane.

[0088] The backend device 100 can include an insertion module 140, configured to obtain, for instance from a user, an insertion position 145. Advantageously, the insertion module 140 can cause the 3D shapes 135 to be displayed on the user device 40, for instance by superimposing the edges of the 3D shapes 135 over the digital image 10, facilitating the task of using an input device to position the virtual object 20 orthogonal to the tridimensional polygon corresponding to the desired ground plane, at the desired depth e.g., in front of or behind the tridimensional plane corresponding to the desired object.

[0089] The backend device 100 can include a scaling module 150, configured to obtain, for instance from a user, a scaling 155, such that a virtual object of the appropriate size can be shown in the composited digital image 30. In some embodiments, it is possible that the size of the virtual object 20 be known but the scale of the digital image 10 be unknown. In such a case, with reference to Figures 8A and 8B, the scaling module 150 can cause a scaling grid to be displayed superimposed onto, e.g., a ground plane, of the digital image 10 on user device 40, using the camera parameters 115, and allow the user to resize the squares of the scaling grid using an input device, e.g., a scroll wheel of a mouse, such that each square of the grid corresponding to an area of a predetermined size, e.g., 10 cm by 10 cm, from which a scaling is determined. Figures 8A and 8B provide simple visual feedback for estimation of the ground plane scale. In some embodiments, it is possible that the size of the virtual object 20 be unknown. In such a case, the scaling module 150 can cause the virtual object to be displayed superimposed onto the digital image 10 in user device 40, and allow the user to resize the virtual object 20 directly using an input device, e.g., a scroll wheel of a mouse. [0090] Referring once more to Figure 1 , the backend device 100 can include a light parameters module 160, configured to obtain lighting parameters 165, including for instance the position, the direction, the angular size and the colour of one or more light sources in digital image 10. In some embodiments, the light parameters module 160 is configured to infer the lighting parameters 165, for instance using an artificial neural network trained for this task, as described, e.g., in International Patent Application Publication No. WO 2021/042208 or in U.S. Provisional Application No. 63/364,588, the contents of which is hereby incorporated by reference. In some embodiments, the light parameters module 160 is configured to allow a user to edit inferred lighting parameters 165 and/or to allow a user to specify arbitrary lighting parameters 165 through the user device 40.

[0091] The backend device 100 can include a rendering module 170, configured to generate a render 175 of the virtual object 20 ready for compositing in the input digital image 10. The rendering module 170 takes as input the virtual object 20, the segmentation 125 and the insertion position 145 and create a render 175 ready to be composited into digital image 10 by the compositing module 180, creating the composited digital image 30. When the insertion position 145 has the effect of placing the virtual object 20 behind a detected object in digital image 10, rendering module 170 uses the segmentation 125 of the detected object to cause an occlusion of virtual object 20, e.g., by cropping a portion of the virtual object 20 that is not to be visible because it is placed behind the detected object. When the segmentation 125 is an alpha matte, occlusion will cause the virtual object 20 to be partially visible in pixels of the detected object that are not entirely opaque. The rendering module 170 can additionally take as input a scaling 155. When a scaling 155 is available, the virtual object 20 can be scaled according to the scaling 155 before being rendered.

[0092] The rendering module 170 can additionally take as input lighting parameters 165. When lighting parameters 165 are available, one or more shadows of the virtual object 20 can be rendered so as to appear to be cast on the ground plane orthogonally to which the virtual object 20 is inserted. When the position of one or the shadows has the effect of placing the shadow behind a detected object in digital image 10, rendering module 170 uses the segmentation 125 of the detected object to cause an occlusion of the shadow, e.g., by cropping a portion of the shadow that is not to be visible because it is placed behind the detected object. When the position of one or the shadows has the effect of placing having the shadow fall over the edge of the ground plane orthogonally to which the virtual object 20 is being inserted in digital image 10, rendering module 170 uses the segmentation 125 of the ground plane to cause an occlusion of the shadow, e.g., by cropping a portion of the shadow that is not to be visible because it would be cast outside of the ground plane.

[0093] As an example, with reference to Figure 7A, a virtual object corresponding to a cube 710 in inserted orthogonally to a ground plane corresponding to the surface of a table 720, away from a detected object corresponding to a house plant 730. A shadow 715 is cast onto the surface of the table 720. Because neither the cube 710 nor the shadow 715 are to be rendered at a position partially behind the plant 730 or outside the table 720, no occlusion is applied. As another example, with reference to Figure 7B, a virtual object corresponding to a cube 710 in inserted orthogonally to and near the edge of a ground plane corresponding to the surface of a table 720, away from a detected object corresponding to a house plant 730. A shadow 715 is cast onto the surface of the table 720. Because the shadow 715 falls over the edge and would therefore be rendered at a position partially outside the table 720, occlusion of the part of the shadow that would be rendered outside of the table 720 is applied. As another example, with reference to Figure 7C, a virtual object corresponding to a cube 710 in inserted orthogonally to a ground plane corresponding to the surface of a table 720, partially behind a detected object corresponding to a house plant 730. A shadow 715 is cast onto the surface of the table 720. Because both the cube 710 and the shadow 715 are partially behind the plant 730, occlusion of the part of the cube and of the shadow that would be behind the plant 730 is applied. [0094] With reference once more to Figure 1 , to acquire a virtual object 20 that corresponds to a portion of a source image, it is possible to apply the segmentation module 120 to the source image in order to obtain a segmentation 125 of the desired object of the source image. The object is then cropped off the source image along the contours specified by the segmentation 125 and becomes a virtual object 20, ready for compositing in input digital image 10. As an example, Figure 9A shows a digital image comprising a ground plane 920 and a detected object 930, Figure 9B shows the same digital image with a virtual object 910 inserted in front of the detected object 930, and Figure 9C shows the same digital image with the virtual object 910 inserted behind the detected object 930. With reference to Figure 9D, because detected object 930 is itself detected and delineated by the segmentation module 120, it can be cropped off the digital image and used as an additional virtual object 980 to be inserted in the digital image. Cropping detected object 930 off the digital image will leave a “hole” 985 at the original position of object 930. The backend device 100 can therefore include an inpainting module, configured to fill “hole” 985 with an inferred texture after object 930 is cropped off. Techniques to implement an inpainting module, e.g., Suvorov, Roman, et al., “Resolution-robust large mask inpainting with Fourier convolutions”, Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2022, the contents of which is incorporated herein by reference, are known to the art.

[0095] With reference to Figure 2, an exemplary method 200 for compositing a virtual object in an input digital image acquired with a camera is shown. Broadly described, method 200 provides a workflow adapted to generate a composited digital image from an input digital image and a virtual object, including steps for calibration 210, detection and delineation 220, projection 230, insertion and rendering 240, and compositing 250. It can be appreciated that steps of method 200 can be performed sequentially or in parallel in any order conditionally to all the inputs required by a given step being produced before starting that step. As an example, the steps of calibration 210 and detection and delineation 220 can be performed in any order or in parallel, but both must be finished before the step of projection 230 is started, because projection 230 depends on camera parameters obtained during the step of calibration 210 and on segments obtained during the step of detection and delineation 220.

[0096] A first step of method 200 can be the step of camera calibration 210, which can for instance be performed by a calibration module implementing a convolutional neural network trained to infer camera parameters such as the 3D camera orientation (elevation, tilt, camera height), the camera field of view and the camera aperture from the input digital image.

[0097] A second step of method 200 can be the step of detection and delineation 220, which can for instance be performed by a detection and delineation module implementing a convolutional neural network trained to infer, from the input digital image, a segmentation mask or segmentation contours corresponding to the two- dimensional shape of one or more ground plane and one or more detected object orthogonal to the one or more ground planes from the input digital image. In some embodiments, the segmentations mask or segmentation contours can be inferred by the neural network with the help of coordinates provided by a user via a user device.

[0098] Once detection and delineation 220 of detected objects is performed, a subsequent step can include cropping 222 one of the objects from the digital image for use as a virtual object to be reinserted at a different position in the scene represented by the digital image. Performing the cropping 222 step will result in a “hole” appearing in the digital image at the position from which the detected object was cropped. A subsequent step can therefore include inpainting 224, which can for instance be performed by an inpainting module implementing a fast Fourier convolutions neural network trained to fill a region of the digital image with an appropriate texture.

[0099] A next step of method 200 can be the step of projection 230, which can be performed by simple trigonometric calculations adapted to back-projecting two- dimensional ground planes as tridimensional polygons and two-dimensional orthogonal objects as tridimensional orthogonal planes using their segmentation and the camera parameters.

[00100] A next step of method 200 can be the step of insertion and rendering 240, during which a user specifies the insertion position of the virtual object, for instance through an insertion module, by manipulating the virtual object in the simplified tridimensional scene representation generated from the tridimensional polygons and planes, and the virtual object is rendered at the specified insertion position, for instance by a rendering module.

[00101] During the insertion and rendering 240 of the virtual object, a step of scaling 242 the virtual object can be performed, for instance by a scaling module. As an example, if the dimensions of the virtual object are known, a user can provide a scale for the image, so that during rendering the virtual object can be scaled appropriately. As another example, even if the dimensions of the virtual object are not known, a user can directly manipulate and scale the virtual object before rendering.

[00102] During the insertion and rendering 240 of the virtual object, if lighting parameters are known, a step of shadow casting 244 can be performed, for instance by the rendering module, using simple trigonometric calculations on the position and dimension of the virtual objects and the lighting parameters, e.g., the position and direction of a light source.

[00103] During the insertion and rendering 240 of the virtual object, if a virtual object is to be rendered behind a detected object of the digital image, an occlusion step 246 can be performed, for instance by the rendering module, e.g., by cropping off the portions of the virtual objects that would be behind the detected object according to the virtual object insertion position and the segmentation of the detected object. Additionally, if a shadow is to be rendered behind a detected object of the digital image and/or off the edge of the ground plane, the occlusion step 246 can be performed, for instance by the rendering module, e.g., by cropping off the portions of the shadow that would be behind the detected object and/or outside the ground plane according to the virtual object insertion position and the segmentation of the detected object and/or the ground plane.

[00104] A final step of method 200 can be the step of compositing 250, during which a new, composited digital image is created by inserting the rendered virtual object at the specified position in the input digital image.

[00105] Figures 5A to 5D are diagrams illustrating an example application of steps 210 to 240. Figure 5A shows a tridimensional model of a virtual object 510, either estimated from an image from which the object was, e.g. , cropped or captured from specialized hardware, to be inserted in a standard two-dimensional RGB image. Figure 5B shows a standard two-dimensional RGB image, where one ground plane 520 and two detected objects 530a, 530b are segmented during the detection and delineation 220 step. From the image, the camera calibration such as the field of view, and the camera 3D position is estimated at the calibration 210 step. From the extracted information, a 3D scene is composed from the ground segment and the object segments. Figure 5C shows the virtual object 510 being inserted behind a segmented object 530a and light interaction (shadow 515) being inserted on the ground. Finally, Figure 5D represents a different view of the 3D scene with the camera position 550 inferred during calibration 210 being used to obtain a tridimensional projection of the ground plane 521 and of the detected objects 531 a, 531 b, and a virtual light 560 either specified or inferred.

[00106] Figures 6A to 6D show diagrams of another example application of steps 210 to 240. Figure 6A shows a ground plane segment 620 and detected objects 630a, 630b in a two-dimensional image over a background 640. Figure 6B shows the camera parameters, including the position of the camera 650, having been extracted during calibration 210, the back-projected 3D polygon 621 corresponding to the ground plane, and the back-projected 3D planes corresponding to the detected objects 531 a, 531 b. Figure 6C shows a virtual object 610 being placed into the 3D scene, behind the back-projected object 631 a. Figure 6D shows the result of the compositing 250 step, the virtual object 610 having been inserted behind object 630a with the appropriate occlusion and its shadow being cast on ground plane 620 also with the appropriate occlusion having respect to object 630a over the background 640.

[00107] Figure 10 presents a flow chart depicting an embodiment of a system 2 including a contour selection algorithm via a neural network implemented in a backend device 100. Figure 10 provides a functional diagram for the geometry creation process. From a standard RGB image as input 10, camera parameters 115 are extracted as well as an outline/segmentation of the image through a parametric model. The parametric model can be a circle, a square, or a different segmentation method can be used as explained above of shown in Figure 13. This module gives a set of image coordinates that describe the plane contour in the image. Finally, using the camera projection model (with the previously estimated parameters), this outline can be back-projected in 3D to a ground plane, providing a 3D polygon 136 with respect to the camera that represents the plane segment that the user selects.

[00108] Figure 11 presents a flow chart depicting an embodiment of a system 3 including a contour selection algorithm via a neural network implemented in a backend device 100. Figure 11 also provides a functional diagram for the geometry creation process. This embodiment may further comprise a lighting extraction module. Lighting parameters can be inferred from the input image 10. This enables the use of a light simulation system to cast shadows/reflections 176 on the 3D polygon for the final rendering.

[00109] For simplicity and clarity of illustration, where considered appropriate, reference numerals may be repeated among the figures to indicate corresponding or analogous elements. In addition, numerous specific details are set forth in order to provide a thorough understanding of the examples described herein. However, it will be understood by those of ordinary skill in the art that the examples described herein may be practised without these specific details. In other instances, well- known methods, procedures, and components have not been described in detail so as not to obscure the examples described herein. Also, the description is not to be considered as limiting the scope of the examples described herein.

[00110] It will be appreciated that the examples and corresponding diagrams used herein are for illustrative purposes only. Different configurations and terminology can be used without departing from the principles expressed herein. For instance, components and modules can be added, deleted, modified, or arranged with differing connections without departing from these principles.

[00111] Although the above principles have been described with reference to certain specific examples, it will be understood by persons skilled in the art that other variants and modifications may be made without departing from the scope of the invention as defined in the claims appended hereto.

Claims

29 CLAIMS

1 . A method for compositing a virtual object in an input digital image acquired with a camera, the method comprising:

- extracting a set of camera calibration parameters corresponding to the camera;

- detecting and delineating at least one ground plane;

- detecting and delineating at least one object orthogonal to one of the at least one ground plane;

- inferring a tridimensional scene by back-projecting each of the at least one ground plane to at least one corresponding tridimensional polygon and each of the at least one object to at least one corresponding tridimensional plane;

- inserting and rendering the virtual object at a specified location in the tridimensional scene orthogonally to one of the at least one ground plane; and

- compositing the rendered virtual object in the input digital image.

2. The method of claim 1 , wherein the camera calibration parameters are estimated using a camera calibration neural network trained to map a calibration input comprising the input digital image to the set of camera calibration parameters.

3. The method of claim 2, wherein training the camera calibration neural network comprises:

- receiving a set of panoramic images;

- creating a plurality of sample images, wherein each of the plurality of sample images is a reprojection of a portion of one random panoramic image from the set of panoramic images using random camera parameters; and 30

- optimizing the camera calibration neural network with each of the plurality of sample images and the corresponding random camera parameters. The method of any one of claims 1 to 3, wherein the at least one ground plane is detected and delineated using a segmentation neural network trained to map a segmentation input comprising the input digital image to an output comprising the delineation of the at least one ground plane. The method of claim 4, wherein the segmentation input further comprises coordinates of at least one activation of a pointing device. The method of claim 5, wherein the coordinates comprise positive coordinates and negative coordinates. The method of any one of claims 4 to 6, wherein the detected at least one object is detected and delineated using the segmentation neural network, wherein the output further comprises the delineation of the detected at least one object. The method of any one of claims 1 to 6, wherein the delineation of the at least one ground plane and the delineation of the detected at least one object are two-dimensional delineations. The method of any one of claims 1 to 7, wherein at least one of:

- the delineation of the at least one ground plane; and

- the delineation of the detected at least one object is an array of coordinates corresponding to a contour. The method of any one of claims 1 to 8, wherein at least one of:

- the delineation of the at least one ground plane; and

- the delineation of the detected at least one object is a segmentation mask.

11 . The method of claim 9, wherein the segmentation mask is an alpha matte.

12. The method of any one of claims 1 to 10, wherein rendering the virtual object in the tridimensional scene comprises scaling the virtual object.

13. The method of any one of claims 1 to 11 , wherein, in response to the virtual object being inserted at least partially behind at least one of the detected at least one object in the tridimensional scene, rendering the virtual object comprises occluding the virtual object with respect to the delineation of the detected at least one object.

14. The method of any one of claims 1 to 12, further comprising the step of estimating lighting parameters, and wherein rendering the virtual object comprises casting a shadow on the corresponding one of the at least one ground plane with respect to the estimated lighting parameters.

15. The method of any one of claims 1 to 12, further comprising the step of defining arbitrary lighting parameters, and wherein rendering the virtual object comprises casting a shadow on the corresponding one of the at least one ground plane with respect to the arbitrary lighting parameters.

16. The method of claim 13 or 14, wherein, in response to the shadow being cast at least partially behind at least one of the detected at least one object in the tridimensional scene, casting the shadow comprises occluding the shadow with respect to the delineation of the detected at least one object.

17. The method of any one of claims 1 to 15, wherein the virtual object is cropped from a second digital image.

18. The method of any one of claims 1 to 15, wherein the virtual object is cropped from the input digital image.

19. The method of claim 17, further comprising inpainting an area of the input digital image corresponding to cropped pixels of the virtual object.

20. A system for compositing a virtual object in an input digital image acquired with a camera, the system comprising:

- a user input device;

- a calibration parameter extraction module configured to extract a set of camera calibration parameters corresponding to the camera;

- a detection and delineation configured to:

- detect and delineate at least one ground plane, and

- detect and delineate at least one object orthogonal to the ground plane;

- a back-projection module configured to infer a tridimensional scene by back-projecting each of the at least one ground plane to at least one corresponding tridimensional polygon and each of the at least one object to at least one corresponding tridimensional plane;

- an insertion module configured to allow for the insertion of the virtual object at a specified location in the tridimensional scene by the user input device;

- a rendering module configured to render the virtual object at the specified location in the tridimensional scene orthogonally to a corresponding one of the at least one ground plane; and

- a compositing module configured to composite the rendered virtual object in the input digital image.

21 . The system of claim 20, wherein the user input device is a pointing device.

22. The system of claim 20 or 21 , wherein the calibration parameter extraction module comprises a camera calibration neural network trained to map a calibration input comprising the input digital image to the set of camera calibration parameters. 33

23. The system of claim 22, further comprising a camera calibration neural network training module configured to:

- create a plurality of sample images, wherein each of the plurality of sample images is a reprojection of a portion of one random panoramic image from a set of panoramic images using random camera parameters; and

- optimizing the camera calibration neural network with each of the plurality of sample images and the corresponding random camera parameters.

24. The system of any one of claims 20 to 23, wherein the detection and delineation module comprises a segmentation neural network trained to map a segmentation input comprising the input digital image to an output comprising the delineation of the at least one ground plane and the at least one detected objects.

25. The system of claim 24, wherein the segmentation input further comprises coordinates obtained from the user input device.

26. The system of claim 25, wherein the coordinates comprise positive coordinates and negative coordinates.

27. The system of any one of claims 20 to 26, wherein the delineation of the at least one ground plane and the delineation of the detected at least one object are two-dimensional delineations.

28. The system of any one of claims 20 to 27, wherein at least one of:

- the delineation of the at least one ground plane; and

- the delineation of the detected at least one object is an array of coordinates corresponding to a contour. 34

29. The system of any one of claims 20 to 28, wherein at least one of:

- the delineation of the at least one ground plane; and

- the delineation of the detected at least one object is a segmentation mask.

30. The system of claim 29, wherein the segmentation mask is an alpha matte.

31 . The system of any one of claims 20 to 30, further comprising a scaling module configured define a scale of the input image, wherein the rendering module is further configured to scale the virtual object with respect to the scale of the input image.

32. The system of any one of claims 20 to 30, wherein the rendering module is further configured to scale the virtual object with respect to an arbitrary scale.

33. The system of any one of claims 20 to 32, wherein the insertion module is further configured to detect that the virtual object is being inserted at least partially behind at least one of the detected at least one object in the tridimensional scene, and wherein the rendering module is further configured in response to the virtual object being inserted at least partially behind at least one of the detected at least one object in the tridimensional scene the virtual object comprises to occlude the virtual object with respect to the delineation of the detected at least one object.

34. The system of any one of claims 20 to 33, further comprising a lighting parameter estimation module configured to estimate lighting parameters, wherein the rendering module is further configured to cast a shadow of the virtual object on the ground plane with respect to the estimated lighting parameters. 35 The system of any one of claims 20 to 33, wherein the rendering module is further configured to cast a shadow of the virtual object on the ground plane with respect to arbitrary lighting parameters. The system of claim 34 or 35, wherein the rendering module is further configured to detect that the shadow is being cast at least partially behind at least one of the detected at least one object in the tridimensional scene and in response to the shadow being cast at least partially behind at least one of the detected at least one object in the tridimensional scene to occlude the shadow with respect to the delineation of the detected at least one object. The system of any one of claims 20 to 36, further comprising a cropping module configured to acquire the virtual object by cropping a source digital image. The system of claim 38, further comprising an inpaiting module configured in response to the source digital image being the input digital image to inpaint an area of the input digital image corresponding to cropped pixels of the virtual object. A non-transitory computer readable medium having recorded thereon statements and instructions for compositing a virtual object in an input digital image acquired with a camera, said statements and instructions when executed by at least one processor causing the at least one processor to:

- extract a set of camera calibration parameters corresponding to the camera;

- detect and delineate at least one ground plane;

- detect and delineate at least one object orthogonal to one of the at least one ground plane; infer a tridimensional scene by back-projecting each of the at least one ground plane to at least one corresponding tridimensional polygon and each of the at least one object to at least one corresponding tridimensional plane; 36

- insert and render the virtual object at a specified location in the tridimensional scene orthogonally to one of the at least one ground plane; and

- composite the rendered virtual object in the input digital image.

40. The non-transitory computer readable medium of claim 39, wherein the camera calibration parameters are estimated using a camera calibration neural network trained to map a calibration input comprising the input digital image to the set of camera calibration parameters.

41 . The non-transitory computer readable medium of claim 40, wherein training the camera calibration neural network comprises:

- receiving a set of panoramic images;

- creating a plurality of sample images, wherein each of the plurality of sample images is a reprojection of a portion of one random panoramic image from the set of panoramic images using random camera parameters; and

42. The non-transitory computer readable medium of any one of claims 39 to 41 , wherein the at least one ground plane is detected and delineated using a segmentation neural network trained to map a segmentation input comprising the input digital image to an output comprising the delineation of the at least one ground plane.

43. The non-transitory computer readable medium of claim 42, wherein the segmentation input further comprises coordinates of at least one activation of a pointing device.

44. The non-transitory computer readable medium of claim 43, wherein the coordinates comprise positive coordinates and negative coordinates. 37

45. The non-transitory computer readable medium of any one of claims 43 or 44, wherein the detected at least one object is detected and delineated using the segmentation neural network, wherein the output further comprises the delineation of the detected at least one object.

46. The non-transitory computer readable medium of any one of claims 39 to 45, wherein the delineation of the at least one ground plane and the delineation of the detected at least one object are two-dimensional delineations.

47. The non-transitory computer readable medium of any one of claims 39 to 45, wherein at least one of:

- the delineation of the at least one ground plane; and

- the delineation of the detected at least one object is an array of coordinates corresponding to a contour.

48. The non-transitory computer readable medium of any one of claims 39 to 47, wherein at least one of:

- the delineation of the at least one ground plane; and

- the delineation of the detected at least one object is a segmentation mask.

49. The non-transitory computer readable medium of claim 48, wherein the segmentation mask is an alpha matte.

50. The non-transitory computer readable medium of any one of claims 39 to 49, wherein rendering the virtual object in the tridimensional scene comprises scaling the virtual object.

51. The non-transitory computer readable medium of any one of claims 39 to 50, wherein, in response to the virtual object being inserted at least partially behind 38 at least one of the detected at least one object in the tridimensional scene, rendering the virtual object comprises occluding the virtual object with respect to the delineation of the detected at least one object. The non-transitory computer readable medium of any one of claims 39 to 51 , wherein the statements and instructions further cause the at least one processor to estimate lighting parameters, and wherein rendering the virtual object comprises casting a shadow on the corresponding one of the at least one ground plane with respect to the estimated lighting parameters. The non-transitory computer readable medium of any one of claims 39 to 51 , wherein the statements and instructions further cause the at least one processor to define arbitrary lighting parameters, and wherein rendering the virtual object comprises casting a shadow on the corresponding one of the at least one ground plane with respect to the arbitrary lighting parameters. The non-transitory computer readable medium of claim 52 or 53, wherein, in response to the shadow being cast at least partially behind at least one of the detected at least one object in the tridimensional scene, casting the shadow comprises occluding the shadow with respect to the delineation of the detected at least one object. The non-transitory computer readable medium of any one of claims 39 to 54, wherein the virtual object is cropped from a second digital image. The non-transitory computer readable medium of any one of claims 39 to 54, wherein the virtual object is cropped from the input digital image. The non-transitory computer readable medium of claim 56, wherein the statements and instructions further cause the at least one processor to inpaint an area of the input digital image corresponding to cropped pixels of the virtual object.