WO2021210995A1

WO2021210995A1 - Method for automatic object removal from a photo, processing system and associated computer program product

Info

Publication number: WO2021210995A1
Application number: PCT/PL2020/050029
Authority: WO
Inventors: Jakub ŁUKASZEWICZ; Rafał MUSZYŃSKI; Marcin CHALECKI; Paweł KUBIK; Filip SKURNIAK; Michal Kudelski; Bartosz BISKUPSKI
Original assignee: Tcl Corporate Research (Europe) Sp. Z O.O.
Priority date: 2020-04-17
Filing date: 2020-04-17
Publication date: 2021-10-21

Abstract

The invention provides crowd removal as a new feature of camera or any device with embedded camera that allows to automatically remove crowd or other moving objects from a photo. The computer-implemented method for automatic object removal from a photo comprises providing a reference photo containing at least one object to be removed from the reference photo, providing at least one consecutive photo containing at least one object to be removed from the reference photo, performing alignment of each of at least one consecutive photo with the reference photo, for the reference photo and each at least one consecutive photo performing separately: a) determination of at least one object to be removed from the reference photo, b) division of the photo into tiles, c) calculation of the cost function for each tile, next, based on all tiles from all photos, searching for the best new combination of tiles by performing optimization of the global cost function, outputting a photo comprising the best combination of tiles for which the global function has the minimum value, said outputted photo being the reference photo with at least one object removed and replaced by background from at least one consecutive photo.

Description

Method for automatic object removal from a photo, processing system and associated computer program product.

[0001] The present invention relates to a method for automatic object removal from a photo, particularly for mobile devices provided with a camera. The invention also relates to a processing system able to implement the method and a computer program product associated with the method.

BACKGROUND

[0002] Removing crowd and unwanted objects to improve quality of a photo is one of the situations in which a photo edition or retouching is highly desirable. Especially a lot of photos are taken in places where people are present all the time. The result is that unknown persons or persons occluding some important parts of the scene are present in photos what makes photos less attractive for the author. Crowded places are for example tourist spots, streets, museums. In such places it is often very hard to take a picture without unwanted persons in the back.

[0003] For this purpose, it is known in the prior art to edit a single scene and retouch a single photo by inpainting. It is also known in the prior art to edit and retouch multiple scenes by inpainting. Video inpainting refers to a field of computer vision that aims to remove objects or restore missing or tainted regions present in a video sequence by utilizing spatial and temporal information from neighboring scenes. The overriding objective is to generate an inpainted area that is merged seamlessly into the video. In that way when the video is played as a sequence, the visual coherence is maintained throughout and no distortion in the affected area is observable to the human eye.

[0004] For single photo editing there are also known other solutions, usually used by professional graphic designers, which provide high quality tools (e.g. Photoshop) but require a lot of manual work and high level expertise. There are also apps made for mass market (e.g. TouchRetouch) which are much simpler but they offer significantly lower quality and also require manual selection of objects than need to be removed.

[0005] In general, all known solutions for object or crowd removal on a single photo requires manual operation from the user.

[0006] Thus, there is still a need to provide good quality and automated method for crowd removal on a single photo.

DISCLOSURE OF THE INVENTION

[0007] Starting from the depicted prior art, the invention is based on the object of developing a method for automatic object removal from a photo which would be suitable, firstly, for ensuring automatic removal of unwanted objects from a photo, like crowd or cars without additional manual operation of the user and, secondly, for being able to obtain a good quality output photo.

[0008] This object is achieved by a method for automatic object removal from a photo, a processing system comprising an image processing pipeline according to the invention, and an associated computer program product. [0009] The invention provides crowd removal as a new feature of camera or any device with embedded camera that allows to automatically remove crowd from a photo using deep neural networks and semantic methods (object detection, semantic segmentation etc.).

[0010] According to a first aspect of the invention a computer-implemented method for automatic object removal from a photo is provided. The method according to the invention comprises providing a reference photo containing at least one object to be removed from the reference photo, providing at least one consecutive photo containing at least one object to be removed from the reference photo, performing alignment of each of at least one consecutive photo with the reference photo, for the reference photo and each at least one consecutive photo performing separately: a) determination of at least one object to be removed from the reference photo , b) division of the photo into tiles, c) calculation of the cost function for each tile, next, based on all tiles from all photos, searching for the best new combination of tiles by performing optimization of the global cost function, outputting a photo comprising the best combination of tiles for which the global function has the minimum value, said outputted photo being the reference photo with at least one object removed and replaced by background from at least one consecutive photo. [0011] Advantageous developments of the method for automatic object removal from a photo according to the invention are specified in the dependent claims.

[0012] The method in accordance with the invention may also have one or more of the following features, separately or in combination:

- steps from a) to c) are performed on the fly after providing each consecutive photo

- detection of at least one object to be removed from the reference photo is a semantic object detection using a neural network

- each photo is divided into at least 2x 2 tiles

- in the optimization step a loopy belief propagation algorithm is used

- the step of outputting a new photo comprises blending

- the step of providing at least one consecutive photo ends automatically

- the step of alignment comprises key points determination using a combined corner and edge detector

- in each consecutive iteration of the global cost function optimization step a new value of the cost function is calculated for all candidates tiles being neighboring tiles of the tile replaced in the reference photo in the previous iteration

[0013] According to a second aspect a processing system for automatic object removal from a photo is provided. The system comprises at least a memory and a processor which is configured to implement steps of the method according to the invention.

[0014] The invention also concerns a computer program product comprising instructions, which when executed by a processor, cause the processor to perform steps of the method according to the invention.

[0013] The proposed invention allows for automatic removal of crowd or other moving objects from a photo. Thanks to acquisition of at least two photos information in time regarding moving objects and background can be gathered. [0014] Thanks to a specific cost function calculated for each predefined subregion of each consecutive image it is possible to evaluate fitting of each distinct subregion taken from other photos into the reference photo which undergoes edition.

[0015] Thanks to the context based global optimization of the cost function the proposed invention allows to edit photos so as to ignore regions which should not change or its change is not important in the context of a better photo.

BRIEF DESCRIPTION OF THE DRAWINGS

[0016] Other advantages and features will become apparent on reading the description of the invention and from the appended drawings, in which:

FIG. 1 shows an input photo and an output photo of the method a according to the invention; FIG. 2 shows schematically a process of acquisition of multiple photos and their processing by a method according to the invention;

FIG. 3 shows key points found in each consecutive photo which are used for alignment of each consecutive photo with the reference photo;

FIG. 4 shows exemplary objects in frames found by a semantic object detection trained model; FIG. 5 shows exemplary tiles with background to be used as replacement of chosen tiles in the reference photo;

Fig. 6 shows another example of input reference photo and an output photo with removed objects; Fig. 7 shows a flow chart of the method in accordance with one embodiment of the invention; DETAILED DESCRIPTION

[0017] Aspects of the present invention are described below with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

[0018] The general idea of the invention is to allow users of a digital camera to take not only a single photo, as in traditional camera mode, but continuously capture consecutive frames for as long as required by the user in order to acquire enough information for further processing.

[0019] The invention is based on the assumption that people in the crowd do not stand still and often will leave the scene after a while.

[0020] The captured data is processed online to detect persons and objects using deep neural networks. This information allows to remove unwanted people and objects from the reference photo.

[0021] Figure 1 shows in general what is the purpose and what are the results of the method of automatic object removal from a photo. It can be observed that an input image contained two unwanted walking persons in the scene. The output image received at the end of the method according to the invention contains only the person whose picture was taken while the crowd is replaced by a background of good quality.

[0022] Now the method for automatic crowd removal from a photo, according to the invention will be described in reference to Figure 8 which shows an associated flow chart of steps. The proposed method for automatic object removal from a photo is composed of several steps. The method begins by a step 100 of providing multiple photos. This step 100 comprises two substeps. [0023] The first substep 101 is a substep 101 of providing ‘a reference photo’ by means of a camera. The ‘reference photo’ means here a first photo from a series of photos taken continuously by the user of the camera. The reference photo is the one from which at least one object will be removed and replaced by a background from another photo.

[0024] The method according to the invention requires multiple photos to be taken. Apart a reference photo a set of images are taken in a sequence by the user of the camera from approximately exactly the same perspective in a predetermined period. The series of images can be images registered by a standalone digital camera or a digital camera embedded into another digital device like a mobile phone. The images can be written into a separate memory before processing or can be captured directly from the camera.

[0025] Thus, the method comprises a substep 102 of providing at least one consecutive photo. The substepl 02 of providing at least one consecutive input image comprises preferably capturing consecutive photos and their parameters directly from the camera in on the fly mode. Alternatively, the reference photo and at least one consecutive photo can be provided by reading it from a memory, acquiring image parameters, in particular its size and resolution, and optionally displaying it on a display for user perception.

[0026] In one embodiment, if the method is performed on the fly, this is the user who decides when to stop acquiring consecutive photos. In practice the period of acquisition can last 5 to 20 seconds. In another embodiment the decision how many consecutive images should be processed is taken automatically based on the amount of acquired information, namely the method stops automatically if the required amount of information has been gathered for removal off all detected objects

[0027] Several steps used for processing of each acquired photo will be described now below. The person skilled in the art will understand that those steps can be performed on the fly for each photo as it arrives, or can be performed in another technically reasonable sequence once all photos are registered.

[0028] Next step of the method of automated object removal from a photo is an image alignment step 200. In this alignment step each incoming image is aligned onto the reference frame of the first image, namely the reference photo. A ’reference frame’ means here an initial position of the reference photo.

[0029] First a substep 201 of key points determination is performed. For example, a Shi-Thomasi feature detector as described in Chris Harris and Mike Stephens. "A Combined Corner and Edge Detector". Alvey Vision Conference, 1988, is used for the purpose of key points finding. By using this approach key points can be found very quickly even on a mobile device. Key points are used to detect relative position between consecutive input images. The result is shown in Fig. 3. [0030] Next, the image alignment step 200 comprises a matching substep 202. Within this step each input image from a set of consecutive images are matched using for example RANSAC algorithm as described in M. A. Fischler, R. C. Bolles. Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. Comm of the ACM, 1981 .Knowing position of those key points in each consecutive input image each incoming image can be warped into the reference frame of the first image. Such aligned image can be used in further processing. By performing this alignment step 200 it is guaranteed that the content of each incoming input image can be objectively compared with the reference photo, namely the first image. This is required since in practice a hand of the user is moving in time between capture moments of two consecutive images and the same object is captured with different coordinates on each photo. It will be obvious for the person skilled in the art that the alignment step 200 will cause cutting some parts of the consecutive images for further processing.

[0031] Next step of the method of automated object removal from a photo is semantic object detection step 300 in which for example a known Yolo algorithm is used for object detection on the reference photo and each consecutive photo. In particular, in one embodiment, all persons are detected on the reference photo and each consecutive photo. In another embodiment other type of object can be detected. In the step 300 of semantic object detection in a photo a deep neural network is used to processes the reference image and other consecutives photo so as to detect objects, in particular persons. In this step a regular rectangular regions of the photo are outputted by the object detection algorithm implemented with NN. For example, for this purpose a Yolo v3 neural network architecture as described in Joseph Redmon, Ali Farhadi. “YOLOv3: An Incremental Improvement” 2018 can be trained on custom dataset. The data set comprises photos with labeled people. Of course, the model can be trained for other type of objects based also on a labeled data set. Here in this description, such a detected region is called ‘a region of interest’. Such regions are shown in Fig.4. All captured images and detected regions of interest are temporarily registered. The information which parts of a photo contains detected object is contained in a specific data structure and later on assigned to each corresponding image subregion extracted in the next step of the method according to the invention.

[0032] For each incoming image (including the reference photo) the method according to the invention passes to a step 400 of dividing image into tiles. Here ’a tile’ means a 2D rectangular subregion of an image received by partitioning said image with the use of a net having a mesh of predetermined size. On the other hand, a tile is a 3D element resulting from partition of a 3D structure, the 3D structure consisting of a set of consecutive images. The first two dimensions of said 3D structure are physical dimensions of each consecutive image while the third dimension is time. As it will be explained later a tile is a node in a 3D Markov Field.

[0033] In general, the size of a tile in 2D will results from a predetermined number of meshes into which an image should be split in 2D, while said number of tiles depends on the computation power of the hardware. In one embodiment the decision how to partition a photo into tiles cab be taken automatically, for example based on hardware capacity of a specific mobile phone type. Preferably, the size and number of tiles is predetermined. Each photo is divided into at least 2x2 tiles. For example, photos are partitioned into 21x11 tiles. Examples of tile are shown in fig.5. In case of photos having 1920 x 1080 resolution the size of a tile is (1920 / 21 ) x (1080 / 11 ).

[0034] Each photo is partitioned using the same net and contains a predetermined number of rectangular subregions, namely each photo is divided into exactly the same number of tiles. The person skilled in the art will understand that smaller the size of a tile is the more detailed processing is possible. For example, more detailed processing is desired if two persons are close to each other on the photo. On the other hand, the processing cannot be time-consuming, thus the number of tiles can be determined dynamically in each case separately.

[0035] In consequence, the step 400 of dividing each photo into tiles results in a photo consisting of regular rectangle puzzles. Tiles of the same sizes and of the same coordinates are present on the reference photo and other consecutives photos. Tiles from the consecutives photos are called candidate tiles. Theoretically each candidate tile can replace a tile of the same coordinates on the reference photo. However only chosen tiles from the reference photo should be replaced by a candidate tile from one of consecutives photos. Examples of tiles are shown in Fig.5. Moreover, each tile is assigned an information about presence of a detected object within its area. As a consequence, the output information from the trained neural network model which is further kept in a tile can be used also in checking whether the acquisition of photos can be early stopped. Namely, if for each tile in the reference photo, for which an object to be removed has been detected, there is at least one candidate tile in at least one consecutive photo which does not contain said detected object, then it means that enough information has been acquired for processing.

[0036] Once the image is split into tiles, based on the neural network output, to each tile an information about the presence or a lack of detected object within its area is assigned. This will allow further to

[0037] The answer for the question which tile should be replaced, and whether it should be at all, is received by performing further steps of the method.

[0038] For each photo, including the reference photo the method according to the invention passes to a tile cost function calculation step 500. Each of tiles of each image is considered as a separate node of the Markov filed (MRF). In practice this step results in receiving a set of specific values C (x_P) for each tile, each value representing matching degree of the tile into a specific position of the reference photo. Such set of values are also calculated for tiles in the reference photo.

[0039] The cost function of assigning candidate tile can be represented by equation:

[0040]

The equation defines cost of assigning candidate tile x_P to position p.

The cost comprises of: • Cost O indicating if tile x_P contains an object

• Sum of costs V_Pq computed as sum of squared differences between pixels of tile x_P and adjacent pixels of neighbour tile x_q. The sum is computed over all neighbour positions q for given position p.

[0041] The calculated cost function takes into consideration parameters as follows: whether a tile includes a specific object detected in a semantic detection step 300.

How well does a tile fit to neighboring tiles (it favorizes rectangles which fit each other). [0042] The Applicant found that in on the fly mode processing of one incoming photo (including steps 100, 200,300,400,500) takes about 500ms, which is satisfactory for commercial use. [0043] Next step of the method according to the invention is a step 600 of global cost optimization which enables choosing optimal tiles combination forming a new image from all captured images. In practice this step is iterative and enables choosing a combination of tiles which give the smallest possible cost. By iteratively optimizing the global cost function a decision for each tile in the reference photo is taken whether it should be replaced by one of tiles having the same coordinates from one of the consecutive photos.

[0044] If the tiles’ combination is optimal then: moving objects and people are replaced with background, there are no sharp transitions between tiles, there are no ‘ghosts’, there are no incomplete objects like hands being cut off or suitcases ‘in the air’.

[0045] In the global cost optimization substep 503 any algorithm that is typically used for optimizing MRFs, e.g. Loopy Belief Propagation can be utilized. The choice of the algorithm determines the trade-of between the computation speed and possibility of convergence to the optimal configuration. However, in one embodiment the following greedy optimization algorithm can be used.

[0046] The calculation is initialized with tiles taken from the reference frame (of the reference photo). Then each tile is checked by iterating at random order. During check all available versions of a specific tile (namely all other tiles having the same coordinates within the net) is compared with the tile from the reference photo and their cost are calculated. The tile, if any of available versions, is replaced if it has lower cost. After replacement, the neighborhood of the replaced tile is recomputed. It means that new values of the cost function of all four neighboring tiles of the tile already replaced are calculated.

[0047] In other words, in the first iteration of the optimization algorithm the global cost function value for all tiles from the reference photo is known. Moreover, single cost function value is known for each candidate tile having a position which corresponds to a considered tile in the reference image. During one iteration theoretically all single tile is tentatively virtually replaced by a corresponding candidate tile from each of consecutive photo. This process is random in terms of the tile position from the reference photo for which the calculation is performed. However, said one iteration ends if a replacing tile for which the global cost function value is lower is found. Then another iteration starts and early stopping of the optimization step is not possible. [0048] In each consecutive iteration all calculations regarding global cost function are performed the same way. However, in each consecutive iteration first the cost function for four neighboring candidate tiles of the tile that has been replaced in the previous iteration is calculated. This is because the cost function for a specific tile takes into account a factor resulting from neighboring tiles. It means that the cost function becomes out of date for a candidate tile once one of its neighboring tile in the reference photo has been replaced. The value of the global cost function can be determined correctly in the consecutive iteration only for updated input data if the calculated value of the global cost function is lower, then again, the replacement takes place and the iteration stops or the calculation is performed randomly for another tile in the reference photo. [0049] The optimization algorithm stops if the predetermined number of iterations has been performed or there are no more tiles in the reference photo for which the replacement would result in receiving a lower value of the global cost function (no more replacement takes place).

[0050] Given the possibility of converging to a sub-optimal solution, restarting of the algorithm multiple times with random initialization can be considered. The number of restarts should scale with the amount of computation that can be performed according to the specification of the final device in which the method will be implemented.

[0051] If during optimization there is no progress and loss is not diminishing, the global cost optimization stops, which can be called an early stopping.

[0052] Then the method goes to a step 700 of blending. This processing is performed in order to output a final photo having at least one object removed and replaced by a background. In one embodiment, once a region (tile) to be reconstructed is calculated in a specific iteration of the optimization step, it is blended into the reference photo, for example if on the fly displaying is required. In another embodiment blending is performed after the end of the optimization step. Blending can be done by any suitable algorithm, for example using Poisson blending as described in [4] Patrick Perez, Michel Gangnet, Andrew Blake, "Poisson Image Editing" 2014. The purpose of this step is to receive a photo of very good quality. The Poisson blending algorithm elevates color differences and reduces number of artifacts. The regions (tiles) are blended onto the reference image to remove crowd and unwanted objects and produce final result.

[0053] The processing system according to the invention (not shown) comprises a memory and a processor which is configured (by comprising several software modules) to implement steps of the method as described above. In one embodiment the processing system is embedded into a mobile phone. The mobile phone according to the invention comprises display means (not shown) for displaying the output of the software modules. From the user perspective the processing system allows him to perform actions as follows: user starts objects or crowd removal and first photo is taken, user continues objects or crowd removal to collect data. More photos are collected, user stops objects or crowd removal and final result is computed. The processing system uses collected data to remove objects, for example people, from the first photo.

[0054] In another embodiment the processing system can comprise a module for supporting steady holding of the camera. Said module outputs information on the position of the camera to be displayed by the displaying means. It increases chances of acquisition of consecutive photos taken from the same perspective. In one embodiment, display means are configured to show a reference photo with frames surrounding detected objects and a camera position indicator. The user can see when enough information has been acquired to remove a specific object, for example based on the frame colour. In another embodiment the method stops automatically if the required amount of information has been gathered for removal off all detected objects.

[0055] Aspects of the present invention can be realized by a computer of a system or apparatus (or devices such as a CPU or MPU) that reads out and executes a computer program product recorded on a memory device to perform the functions of the above-described embodiment(s), and by a method, the steps of which are performed by a computer of a system or apparatus by, for example, reading out and executing a program recorded on a memory device to perform the functions of the above-described embodiment(s). For this purpose, the claimed computer program product is provided to the computer for example via a network or from a recording medium of various types serving as the memory device. The computer program product according to the invention comprises also a non-transitory machine-readable medium.

[0056] It should be understood that the present invention is not limited to the above examples. For those of ordinary skill in the art, improvements or changes can be made according to the above description, and all these improvements and changes should fall within the protection scope of the appended claims of the present invention.

Claims

1 .A computer-implemented method for automatic object removal from a photo, comprising

- providing a reference photo containing at least one object to be removed from the reference photo

- providing at least one consecutive photo containing at least one object to be removed from the reference photo

- performing alignment of each of at least one consecutive photo with the reference photo for the reference photo and each at least one consecutive photo performing separately: a) determination of at least one object to be removed from the reference photo b) division of the photo into tiles c) calculation of the cost function for each tile next, based on all tiles from all photos, searching for the best new combination of tiles by performing optimization of the global cost function outputting a photo comprising the best combination of tiles for which the global function has the minimum value, said outputted photo being the reference photo with at least one object removed and replaced by background from at least one consecutive photo.

2. A method according to claim 1 , wherein the calculation of the cost function is performed according to the formula:

where C(x_P) is a cost of assigning a tile x_p to a given position p

O(xp) is a component defining if the tile contains an object to be removed

Vpq(x_p,Xq) is sum of squared differences (SSD) between adjacent pixels of the tile x_p and a neighbouring tile x_q

3. A method according to claim 1 , wherein steps from a) to c) are performed on the fly after providing each consecutive photo.

4. A method according to claim 1 , wherein detection of at least one object to be removed from the reference photo is a semantic object detection using a neural network.

5. A method according to claim 1 , wherein each photo is divided into at least 2x 2 tiles.

6. A method according to claim 1 , wherein in the optimization step a loopy belief propagation algorithm is used.

7. A method according to claim 1 , wherein the step of outputting a new photo comprises blending.

8. A method according to claim 1 , wherein the step of providing at least one consecutive photo ends automatically.

9. A method according to claim 1 , wherein the step of alignment comprises key points determination using a combined corner and edge detector.

10. A method according to claim 1 , wherein in each consecutive iteration of the global cost function optimization step a new value of the cost function is calculated for all candidates tiles being neighboring tiles of the tile replaced in the reference photo in the previous iteration.

11 . A data processing system, comprising: a processor; and a memory coupled to the processor to store instructions, which when executed by the processor, cause the processor to perform operations, the operations comprising:

12. The data processing system according to claim 11 , wherein the calculation of the cost function is performed according to the formula:

where C(x_P) is a cost of assigning a tile x_p to a given position p

O(xp) is a component defining if the tile contains an object to be removed

Vpq(xp,x_q) is sum of squared differences (SSD) between adjacent pixels of the tile x_p and a neighbouring tile x_q

13. The data processing system according to claim 11 , wherein in each consecutive iteration of the global cost function optimization step a new value of the cost function is calculated for all candidates tiles being neighboring tiles of the tile replaced in the reference photo in the previous iteration.

14. A mobile phone comprising a data processing system according to claim 11 .

15. A computer program product comprising instructions, which when executed by a processor, cause the processor to perform operations, the operations comprising: