CN112862838A

CN112862838A - Natural image matting method based on real-time click interaction of user

Info

Publication number: CN112862838A
Application number: CN202110158221.1A
Authority: CN
Inventors: 周文柏; 张卫明; 俞能海; 韦天一; 陈冬冬; 廖菁
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2021-02-04
Filing date: 2021-02-04
Publication date: 2021-05-28

Abstract

The invention discloses a natural image matting method based on real-time click interaction of a user, which comprises the following steps: acquiring an input original image and an indicator graph containing foreground and background information, which is obtained by interacting with a user; extracting an image mask only containing foreground information in the indication image from a complete image mask of the original image according to the indication image to be used as a preliminary image mask; carrying out uncertainty estimation on the preliminary image mask to obtain an uncertainty map, cutting pixel blocks at corresponding positions with uncertainty exceeding a preset value from the preliminary image mask and the original image under the guidance of the uncertainty map, and carrying out local fine modification through a full convolution network without downsampling; and after a local finishing result is obtained, pasting the corresponding position of the initial image mask to obtain the finished image mask. The method has the advantages that the performance of the method is greatly advanced to the prior full-automatic matting method under the condition of only using few user interactions, and the method is equivalent to the most advanced matting method based on the trisection.

Description

Natural image matting method based on real-time click interaction of user

Technical Field

The invention relates to the technical field of computer vision, in particular to a natural image matting method based on real-time click interaction of a user.

Background

Image Matting is a fundamental and challenging problem in the field of computer vision. It requires accurate separation of foreground objects from the background while accurately estimating the per-pixel transparency (alpha) near the separation edge. Because it has a wide range of usage scenarios, for example: image composition and editing, movie production, virtual backgrounds in video conferencing, etc. have been studied by academia and industry for many years.

An image can be formally expressed as a mathematical formula from the perspective of image composition as follows:

I_i＝α_iF_i+(1-α_i)B_i,α_i∈[0,1] (1)

in the above equation, I ═ x, y denotes a pixel position in the input image I, and α_i,F_i,B_iRespectively representing the foreground object transparency, foreground value and background value at pixel point i. This formula defines the pixel level interpretation of the image imaging: each pixel point in the image is formed by linearly combining a foreground and a background, and alpha_iIt represents the proportional relationship of foreground to background, i.e., transparency. When alpha is_iWhen the pixel value is equal to 1, the pixel point is completely composed of foreground pixel values, namely completely opaque; when alpha is_iWhen the pixel point is equal to 0, the pixel point is completely composed ofBackground pixel value composition, i.e., completely transparent; when alpha is_iWhen the element belongs to (0,1), the pixel point is represented to be formed by linearly combining a foreground pixel value and a background pixel value, and the pixel point is positioned in a joint area of a foreground area and a background area, such as: animal hair, plant branches and leaves, etc.

The aim of image matting is to solve the problem of the optimization by taking the formula (1) as the optimization problem_iAnd (5) forming a single-channel image mask (alpha matte). For each pixel point, the problem needs to solve 7 unknowns including unknown single-channel transparency, 3-channel foreground pixel values and 3-channel background pixel values from known 3-channel pixel values of the image I, and obviously is a highly under-constrained problem.

To solve this problem, many classical approaches rely on a pre-defined ternary graph (Trimap) as an additional input to constrain the solution space. The trimap divides an image into three regions: foreground region, background region and transition region. The foreground region indicates that all pixels within the region consist entirely of foreground pixel values and the background region indicates that all pixels within the region consist entirely of background pixel values. Therefore, the task of image matting is simplified to regress the transparency of each pixel point, namely alpha, only for the transition region in the trisection map_i. Therefore, some image matting methods based on trimap as auxiliary input can generally achieve better performance. However, drawing a suitable trimap is very time consuming and laborious, with some complex examples drawing times even exceeding ten minutes, which is extremely unfriendly for users, especially non-professional users.

With the development of deeply learned fire heat, some matting methods have recently emerged that do not require a tri-segment as an additional auxiliary input. However, their performance is far inferior to the tripartite graph-based matting approach. The main reasons behind this are: for some images, the deep learning network can create ambiguity as to which foreground object to matte due to the lack of the guiding constraint of the trimap. To address this ambiguity, some approaches collect large-scale matting datasets for only certain object classes (e.g., faces) for training deep learning networks. However, this solution also has the problems of inextensibility, high cost, etc., and especially if the user wants to perform matting on categories that do not appear in the training set, the effect is often poor. In addition, if a plurality of characters (i.e. foreground) appear in an image, the user does not need to scratch all the characters, so some user interactions are inevitable, and the key is how to minimize the interaction cost of the user in the interaction process and accurately extract the foreground specified by the user, but at present, no effective scheme exists.

Disclosure of Invention

The invention aims to provide a natural image matting method based on real-time click interaction of a user, which can achieve performance equivalent to the image matting method based on a three-segment image with superior performance by only requiring the user to provide a small number of clicks (in most cases, when the image has no foreground ambiguity problem, no user clicks are needed) to indicate that the position is a foreground or a background.

The purpose of the invention is realized by the following technical scheme:

a natural image matting method based on real-time click interaction of a user comprises the following steps:

interactive matting stage: acquiring an input original image and an indicator graph containing foreground and background information, which is obtained by interacting with a user; extracting an image mask only containing foreground information in the indication image from a complete image mask of the original image according to the indication image to be used as a preliminary image mask;

local refinement stage guided by uncertainty: carrying out uncertainty estimation on the preliminary image mask to obtain an uncertainty map, cutting pixel blocks at corresponding positions with uncertainty exceeding a preset value from the preliminary image mask and the original image under the guidance of the uncertainty map, and carrying out local fine modification through a full convolution network without downsampling; and after obtaining a local fine correction result, pasting the corresponding position of the initial image mask to obtain a fine corrected image mask which is used as a complete image mask of the iteration.

The technical scheme provided by the invention can show that the performance of the method greatly precedes that of the existing full-automatic matting method under the condition of using only a few user interactions, and the method is equivalent to the most advanced matting method based on the tripartite drawing at present.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

FIG. 1 is a frame diagram of a natural image matting method based on user real-time click interaction according to an embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating a process of real-time user click interaction according to an embodiment of the present invention;

FIG. 3 is a comparison of visual tests on real human images provided by an embodiment of the present invention;

fig. 4 is a visual comparative display diagram before and after partial refinement according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides a natural image matting method based on real-time click interaction of a user, which mainly comprises the following two stages as shown in fig. 1.

First, interactive cutout stage.

In the stage, an input original image and an indicator graph containing foreground and background information obtained by interacting with a user are obtained; and extracting an image mask only containing foreground information in the indication map from the complete image mask of the original image according to the indication map to serve as a preliminary image mask.

As shown in fig. 1, the operation of the interactive matting phase is accomplished by an encoder in cooperation with a mask decoder.

The encoder is used for encoding the input original image and the indication map.

And the mask decoder is used for predicting the preliminary image mask according to the coding result of the coder.

Illustratively, the encoder and the mask decoder may be implemented by a U-Net network.

In the embodiment of the invention, the original image can be a three-channel RGB image, and the indication image is a single-channel image.

In the embodiment of the invention, before the interaction with the user, all pixel values in the indication image are 0, and when the interaction with the user is carried out, if a click operation of an indication foreground added by the user is received, a dot with the radius of r and the pixel value of 1 is filled in the corresponding position of the indication image; if a click operation which is added by a user and indicates a background is received, a dot with the radius of r and the pixel value of-1 is filled in the corresponding position of the indication image.

In the embodiment of the invention, in order to enable the deep learning network to adjust the behavior of the deep learning network according to the user click, a main difficulty is to collect the training data with the click, but the method is quite expensive. Therefore, the invention innovatively provides that the user click is simulated in the training process: in the training phase, for each training image, a number (e.g., 0-6) of foreground or background points of a specified radius size (e.g., 15 pixels in radius) are randomly sampled to generate an indication map.

In the training stage, an error function formed by image space loss (reg) and gradient space loss (grad) is designed.

Image space loss applying L to transition region T of original image₁Loss, L is applied to the foreground and background regions S ═ { F, B } of the original image₂Loss:

in the above formula, α_pAnd alpha_gRespectively representing a predicted preliminary image mask and a given supervision mask, and obtaining a mask value and a supervision information value of a corresponding pixel after adding superscripts, wherein | f | represents the number of elements of f; i. j each represents a pixel index;

gradient spatial loss is L of predicted preliminary image mask and supervision mask in spatial gradient₁Loss:

wherein Ω represents all pixels of the original image I, and v is a gradient magnitude symbol, and the network can be effectively encouraged to generate a sharper matting result by introducing a gradient spatial loss amount.

And secondly, local fine modification guided by uncertainty.

This stage aims at realizing automatic local refinement to the preliminary image mask of interactive matting stage output to output more meticulous image mask, mainly include: carrying out uncertainty estimation on the preliminary image mask to obtain an uncertainty map, cutting pixel blocks of corresponding positions with uncertainty exceeding a preset value from the preliminary image mask and the original image under the guidance of the uncertainty map, and carrying out local fine modification through a full convolution network without downsampling; and after obtaining a local fine correction result, pasting the corresponding position of the initial image mask to obtain a fine corrected image mask which is used as a complete image mask of the iteration. The following is a description of the uncertainty estimation and local refinement process.

1) And estimating uncertainty.

As shown in fig. 1, the uncertainty map is estimated by the encoder working in conjunction with an uncertainty estimation module. The uncertainty estimation network is parallel to a mask decoder and shares the same encoder, and primary image mask prediction is described by using univariate Laplace distribution, so that an uncertainty map is estimated:

wherein mu is a preliminary image mask alpha_p(ii) a σ is an uncertainty map σ of the uncertainty estimation module output_pThe larger the value is, the larger the uncertainty degree of the result output by the uncertainty degree estimation module to the matting network is; x is a supervision mask alpha_g(ii) a f (x | μ, σ) represents the laplacian distribution that characterizes the preliminary image mask using both μ and σ parameters, with the ultimate goal of estimating the uncertainty map.

As will be understood by those skilled in the art, the above equation describes the process of the preliminary image mask prediction by using a univariate Laplace distribution, and the true value of each pixel point i is the supervisory information value

The preliminary image mask predicted by the mask decoder is the mean value mu and the uncertainty estimation block is equivalent to the variance sigma in the predicted laplacian distribution, i.e. the uncertainty.

In the embodiment of the invention, the uncertainty estimation module is trained by adopting a negative log-likelihood minimization mode:

where i represents a pixel index.

In the implementation of the invention, the uncertainty represents the uncertainty of each pixel value of the initial image mask, and for the pixel point at any position, if the uncertainty is larger, the initial image mask value output at the corresponding position is more uncertain, so that further refinement is needed.

By the above-mentioned loss function L_ueAfter the uncertainty estimation module is trained, the uncertainty map of the initial image mask can be accurately estimated, so that the subsequent local fine correction process is guided.

2) And (5) local finishing process.

Under the guidance of an uncertain graph, cutting the preliminary image mask and the original image into pixel blocks (the default size of the pixel blocks can be set to 64 × 64) with the uncertainty exceeding a preset value (the specific numerical value can be set by self), and then sending the pixel blocks and the original image together into a full convolution network without downsampling for local refinement.

In the embodiment of the present invention, the full convolution network without downsampling is the refinement network shown in fig. 1. Since the block size of the clipped pixel block is generally much smaller than the size of the original image, the calculation overhead is much smaller than the global refinement approach. Since most of the pixels in the cropped pixel block have been accurately predicted for transparency, only a small fraction of the "stubborn" pixels need to be heavily refined. In order for the refinement network to give more attention to these "stubborn" pixels, a difficult sample mining objective function is employed for training:

where C represents the entire set of pixels, α_pFor preliminary image masking, α_gFor the supervised information, H represents the difficult pixel set ranked K% top in the whole pixel set with the corresponding supervised information error, λ represents the enhancement weight for the difficult pixel set H, and i, j both represent the pixel index.

For example, K may be 20, and λ may be 1, and of course, the numerical values of the parameters in the embodiments of the present invention are all examples and are not limited, and in practical applications, a user may set the numerical values according to needs or experience.

The whole framework shown in fig. 1 adopts a segmented training mode: 1) training the encoder and the mask decoder, the loss function is: l is_alpha＝L_reg+L_grad. 2) Fixing the encoder and the mask decoder, retraining the uncertainty estimation module, and applying the loss function L as described above_ue. 3) Training the refinement network alone with the loss function L introduced above_refine。

As shown in fig. 2, for a summary diagram of a real-time user click interaction process, the frame of the matting method in fig. 2 is the frame shown in fig. 1. At the initial moment, for the input original image, the encoder is matched with the mask decoder to predict a preliminary image mask, and then a refined image mask is obtained through local refinement and is used as a complete image mask at the initial moment. If the expected requirements of the user are not met, determining foreground and background information specified by the user in the original image through interaction with the user, and generating an indication map; after the indication graph is coded by the coder, the indication graph is input to a mask decoder through jumping connection, and a preliminary image mask is extracted from the complete image mask by the mask decoder; and then, local fine modification is carried out to obtain a fine modified image mask which is used as a complete image mask of the iteration. Each iteration comprises two stages of operation, the complete image masks related to the interactive matting stage are all the complete image masks obtained in the last iteration, and in the actual operation, the iteration can be repeated for multiple times until the required complete image masks are obtained. The original input image in fig. 2 contains two foreground objects, the left object is indicated as background and the right object is indicated as foreground in the user interaction, and finally an image mask containing only the right object is output.

Compared with the prior art, the method has the advantages that:

1. the method of the invention provides a brand-new interaction mode, and can achieve the performance equivalent to that of the method based on the three-segment diagram only by few click interaction; compared with a full-automatic image matting method without inputting additional information, the method has the advantages that the semantic ambiguity problem does not exist, the performance is greatly improved, the deep learning network can be generalized to the types which are not seen in a training set only by few clicks, and the high-quality image mask is output.

2. The local fine-trimming method guided by the uncertainty provided by the method can enable a user to flexibly select the number of small blocks to be subjected to local fine-trimming according to the calculation overhead of the user. The local refinement method is more flexible and efficient than existing global refinement methods, and reduces the computational overhead for most of the regions that have been predicted correctly.

On the other hand, in order to illustrate the performance advantages of the method of the present invention, the method of the present invention and other existing methods are subjected to quantitative index pairs on a DIM test set, as shown in Table 1. In the table, LF-matching is a full-automatic Matting method, and all other methods are based on a trisection Matting method. The quantitative evaluation index includes: sum of Absolute Difference (SAD), Mean Square Error (MSE), gradient (Grad), connectivity (Conn), all four indexes are as small as possible. As can be seen from Table 1, the method of the present invention has performance equivalent to that of the method based on the treelet matting while the interaction cost is greatly reduced.

TABLE 1 quantitative comparison results on DIM test set

Fig. 3 shows the comparison of the visual test of the present invention on real figures, when the test is performed, the present invention only trains on the figure data set. As can be seen from the figure, although only the portrait category exists in the training set, by providing the method with some users to click the indication foreground or background, the method can be easily generalized to the category which is not seen in the training set; by giving a background click indication to the foreground object, the method can retain the foreground object desired by the user. Therefore, the advantages of the invention are obvious.

FIG. 4 shows a visual comparative display of the present invention before and after partial refinement. In each row, from left to right, the following are in turn: original image, result before local refinement, result after local refinement, and supervision information. It can be clearly seen that local refinement can significantly improve edge detail, eliminating blurring artifacts.

Through the above description of the embodiments, it is clear to those skilled in the art that the above embodiments can be implemented by software, and can also be implemented by software plus a necessary general hardware platform. With this understanding, the technical solutions of the embodiments can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments of the present invention.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A natural image matting method based on user real-time click interaction is characterized by comprising the following steps:

2. The natural image matting method based on user real-time click interaction as claimed in claim 1, characterized in that the operation of interactive matting phase is completed by the cooperation of encoder and mask decoder;

the encoder is used for encoding an input original image and an indication image;

the mask decoder is used for predicting the preliminary image mask according to the encoding result of the encoder;

at the initial moment, for an input original image, the encoder is matched with the mask decoder to predict a preliminary image mask, and then a refined image mask is obtained through local refinement and is used as a complete image mask at the initial moment; then, through interaction with a user, determining foreground and background information specified by the user in the original image, and generating an indication map; and the indication graph is input to a mask decoder through jump connection after being coded by the coder, and the mask decoder extracts a preliminary image mask from the complete image mask.

3. The natural image matting method based on user click interaction in real time according to claim 1 or 2, characterized in that the loss function of the interactive matting stage includes: image space loss and gradient space loss;

image space loss applying L to transition region T of original image₁Loss, applying L to the foreground and background regions S of the original image₂Loss:

where Ω represents all pixels of the original image I, and ∑ is the gradient magnitude symbol.

4. The method for natural image matting based on real-time click interaction of a user according to claim 1 or 2, characterized in that all pixel values in the indication map are 0 before the user interacts with the user, and when the user interacts with the user, if a click operation of indicating a foreground added by the user is received, a dot with radius r and pixel value 1 is filled in a corresponding position of the indication map; if the click operation of the indication background added by the user is received, filling a dot with the radius of r and the pixel value of-1 in the corresponding position of the indication image;

in the training phase, for each training image, a plurality of foreground points or background points with specified radius sizes are randomly sampled so as to generate an indication map.

5. The natural image matting method based on real-time click interaction of users according to claim 2, characterized in that an uncertainty map is estimated by a coder working in cooperation with an uncertainty estimation module;

the uncertainty estimation network is parallel to a mask decoder and shares the same encoder, and primary image mask prediction is described by using univariate Laplace distribution, so that an uncertainty map is estimated:

wherein mu is a preliminary image mask alpha_pAnd sigma is an uncertainty map sigma output by the uncertainty estimation module_pX is supervision information alpha_g；

Training the uncertainty estimation module using negative log-likelihood minimization:

where Ω represents all pixels of the original image I and I represents the pixel index.

6. The method as claimed in claim 1, wherein the full convolution network without downsampling is a refinement network, and is trained by adopting a difficult sample mining objective function: