CN113838084A

CN113838084A - Matting method based on codec network and guide map

Info

Publication number: CN113838084A
Application number: CN202111126534.5A
Authority: CN
Inventors: 程航; 徐树公
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2021-09-26
Filing date: 2021-09-26
Publication date: 2021-12-24

Abstract

A matting method based on a codec network and a guide image is characterized in that a guide image is drawn according to an original image and is subjected to first prediction through the codec network to obtain a foreground mask, the original guide image is modified according to the predicted foreground mask and is subjected to prediction again through the codec network, and the steps are repeated until an accurate matting result is obtained. The invention can realize accurate cutout through simple operation by taking the ternary diagram, the draft, the click diagram or the full-gray input as the guide diagram.

Description

Matting method based on codec network and guide map

Technical Field

The invention relates to a technology in the field of image processing, in particular to a matting method based on a codec network and a guide picture.

Background

Matting (Image Matting) is a process of generating a foreground mask for separating a foreground object (an object to be scratched out) from a background from an input Image, and a general Matting problem is modeled to solve formula I_i＝α_iF_i+(1-α_i)B_iWherein α ∈ [0, 1 ]]I represents an input image (three channels), alpha (a single channel) represents a foreground mask, F and B (three channels) represent foreground and background areas respectively, and I represents the position of each pixel point; to solve this equation, it is necessary to introduceAdditional constraints are included, and typical constraints are ternary graphs (trimap) or sketches (scribes), etc. Current matting methods are mainly based on deep learning, and the accuracy of these methods is significantly better than the traditional sampling-based and propagation-based methods. Most matting methods use a hand-drawn ternary diagram as a guide diagram to provide guide information, the ternary diagram is time-consuming to draw and is not friendly to user input, and the ternary diagram uses white (with a value of 1), gray (with a value of 0.5) and black (with a value of 0) regions to respectively represent foreground, transition and background regions. The same applies to the drabbelmap and the clicktomap, except that the drafted graph uses black and white curves, and gives less information than the ternary graph, and the clicktomap uses circles, provides less information, has higher requirements on the neural network, but is more convenient for manual input. In the past few years, most matting methods have used ternary images and achieved higher precision.

Disclosure of Invention

Aiming at the defect that the existing cutout technology based on the ternary diagram cannot realize cutout by adopting a draft or a click diagram, the invention provides a cutout method based on a codec network and a guide diagram, and the precise cutout can be realized by simple operation by taking the ternary diagram, the draft, the click diagram or the full-gray input as the guide diagram.

The invention is realized by the following technical scheme:

the invention relates to a matting method based on a coder-decoder network and a guide image, which is characterized in that a guide image is drawn according to an original image, a first prediction is carried out through the coder-decoder network to obtain a foreground mask, the original guide image is modified according to the predicted foreground mask and is predicted again through the coder-decoder network, the process is repeated until an accurate foreground mask is obtained, and then a foreground to be scratched is obtained through the accurate foreground mask and an input image.

The codec network comprises: encoder, semantic information fusion module, jump connection module and decoder, wherein: the encoder extracts a multi-scale deep feature map from an input feature map formed by connecting an input image and a guide map on a channel respectively and outputs the multi-scale deep feature map to a semantic information fusion module, extracts a multi-scale shallow feature map and outputs the multi-scale shallow feature map to a decoder through a jump connection module; the semantic information fusion module performs feature fusion and up-sampling according to the multi-scale deep feature map to obtain deep semantic features of contour information containing the foreground; and the decoder performs up-sampling on the deep semantic features and simultaneously fuses with the multi-scale shallow feature map to finally obtain the foreground mask.

The semantic information fusion module comprises: a feature pyramid enhancement unit (FPEM) and a joint upsampling unit (JPU) are cascaded, wherein: the feature pyramid enhancement unit extracts multi-scale features from the backbone network, fuses the features, enhances semantic information, and the combined upsampling unit upsamples the enhanced features to obtain deep semantic features.

The codec network performs training based on progressive ternary diagram deformation by using a method based on deep learning and an open matting dataset, and specifically comprises the following steps: in the training process, with the increase of the training steps, the ratio of the area of the foreground in the guide map of the input network to the area of the foreground area in the input image is gradually reduced, and the ratio of the area of the background in the guide map of the input network to the area of the background area in the input image is gradually reduced. The amount of deterministic information input to the neural network is gradually reduced, allowing the codec network to learn to predict the foreground mask using the limited foreground and background information given in the pilot map.

The training set and the test set use public matting data sets, wherein the training set comprises a certain number of foreground images and corresponding foreground masks, and a certain number of background images, and the test set comprises test images and corresponding foreground masks; the loss function uses L1, L2 losses.

Technical effects

The invention integrally solves the defects that the prior art is difficult to reduce the input difficulty of a user while keeping the matting precision, and the prior art is difficult to further optimize the matting result; the area of the foreground and the background in the guide map is gradually reduced in the training process, the guide map is gradually changed from a ternary map to a sketch, the amount of the determined information input into the neural network is gradually reduced, the robustness of the neural network can be improved, and the neural network learns to predict the foreground mask by using the given foreground and background information in the guide map instead of being limited in a single domain of the ternary map or the sketch. The user can modify the original guide image according to the foreground mask predicted at the previous time, add some strokes for prompting the foreground and the background in the area which is not predicted correctly, add the information of the guide image in the local part, then predict again and circulate until obtaining satisfactory matting effect.

Drawings

FIG. 1 is a schematic diagram of a network architecture according to the present invention;

FIGS. 2a and 2b are a flow chart and an effect diagram, respectively, of progressive ternary diagram morphing-based training;

FIG. 3 is a schematic flow diagram of a semantic information fusion module;

FIGS. 4a and 4b are a flow chart and an effect diagram, respectively, of an iterative optimized matting method;

FIG. 5 is a schematic diagram of a backbone network and a jump connection module;

FIG. 6 is a schematic diagram of a partial structure of a decoder;

FIG. 7 is a schematic diagram of a progressive ternary diagram deformation process and a relationship between a curve thickness and a training step number;

Detailed Description

As shown in fig. 4a, for the matting method based on the codec network and the guide map according to this embodiment, a guide map is drawn according to an original image and a foreground mask is obtained by performing first prediction through the codec network, the original guide map is modified according to the predicted foreground mask and is predicted again through the codec network, and the process is repeated until an accurate matting result is obtained.

As shown in fig. 1, the foreground mask is obtained by using the codec network according to the present embodiment, based on the feature map obtained by connecting the RGB image and the guide map in the channel dimension.

As shown in fig. 3, 5 and 6, the codec network includes: an encoder section, a semantic information fusion module, three skip connect modules, and a decoder section comprising a spatial attention module, wherein: the three jumping connection modules are respectively arranged between the encoder part and the decoder part and output multi-scale shallow layer feature maps, the semantic information fusion module receives the multi-scale deep layer feature maps output by the encoder part and outputs the multi-scale deep layer feature maps to the decoder part, and the decoder part outputs accurate matting results.

As shown in FIG. 1, the input feature map S1 of the encoder part in the training stage is formed by connecting an RGB map and a guide map on a channel, specifically, the input feature map S1 ∈ R^4×512×512The length and the width are both 512, and after the input feature map is subjected to two layers of 3 x 3 convolution in a convolution module and corresponding batch normalization, spectrum normalization and a ReLU activation function, a 2-time down-sampling feature map S2E R is obtained^32*256*256Then, the 4-time down-sampling feature map S4 ∈ R is obtained through convolution and a first residual error module in sequence^64*128*128Respectively obtaining 8 times down-sampling feature map S8 ∈ R through a second, third and fourth residual modules ^128*64*6416 times down-sampling feature map S16 ∈ R^256*32*32And 32 times down-sampling feature map S32 ∈ R^512*16*16。

The first to fourth residual modules each include a main branch and a downsampling branch, wherein: the main branch comprises two layers of 3 × 3 convolution and corresponding spectrum normalization, batch normalization and ReLU activation functions, and the down-sampling branch comprises one layer of average pooling layer and one layer of 1 × 1 convolution and corresponding spectrum normalization and batch normalization. The feature maps processed by the main branch and the downsampling branch are subjected to element corresponding addition, and the addition result is used as the output of the residual error module through a layer of ReLU activation function.

The jump connection module respectively obtains the sizes of the input feature map S1, the 2-time downsampling feature map S2 and the 4-time downsampling feature map S4 as fea1 ∈ R^32*512*51232*512*512、fea2∈R^32*256*256And fea3 ∈ R^64*128*128And outputting the multi-scale shallow feature map to a decoder part, wherein each jump connection module comprises: two convolutional layers and corresponding spectral normalization, batch normalization and ReLU activation functions.

As shown in fig. 1 and fig. 3, the semantic information fusion module includes: a feature pyramid enhancement unit (FPEM) and a joint upsampling unit (JPU), wherein: the feature pyramid enhancement unit extracts multi-scale features from the backbone network, fuses the features, enhances semantic information, and the combined upsampling unit upsamples the enhanced features to obtain deep semantic features.

The fusion is specifically as follows: and performing multi-scale feature fusion on the 4-fold, 8-fold, 16-fold and 32-fold down-sampled feature maps S4, S8, S16 and S32 feature maps, and outputting feature maps with the sizes of 128 × 128, 128 × 64, 128 × 32 and 128 × 16.

The upsampling specifically comprises: fusing and upsampling feature maps with the sizes of 128, 64, 128, 32 and 128, 16 and 16 to obtain feature maps with the sizes of 512, 64 and 64, performing bilinear interpolation to obtain feature maps with the sizes of 512, 128 and 128, connecting the feature maps with the sizes of 128 and 128 output by the FPEM on a channel, and processing the feature maps by one convolution layer to obtain deep semantic features SFM _ OUT and R^64*128*128。

As shown in fig. 6, the decoder portion includes: a fifth residual module and two spatial attention modules, wherein: the fifth residual error module is used for outputting deep semantic features SFM _ OUT ∈ R according to the semantic information from the semantic information fusion module^64*128*128Fea3 ∈ R output by jump connection module^64*128*128The feature map elements are added and then the feature map is up-sampled by using nearest neighbor sampling and deconvolution, the feature map with the size of 32 x 256 is output, the first space attention module outputs a multi-scale shallow feature map fea2 according to the feature map output by the fifth residual error module and the multi-scale shallow feature map fea2 output by the second jump connection module, respectively calculating the mean value and the maximum value of the feature maps of 32 × 256 in a channel dimension, inputting a layer of convolution and sigmoid functions after the feature maps obtained by calculating the mean value and the feature maps obtained by calculating the maximum value are connected on a channel to obtain an attention map, carrying out element corresponding multiplication on the attention map and a multi-scale shallow feature map fea2 to realize shallow information filtering, carrying out deconvolution and corresponding batch normalization on the obtained feature maps and LeakyReLU to obtain feature maps with the size of 32 × 512, and outputting the feature maps to a second space attention module, wherein the second space attention module outputs the first space attention module.After the obtained feature map is fused with the multi-scale shallow feature map fea1, the number of channels is compressed to 1 by using a layer of convolution, and the value of each pixel is compressed to between 0 and 1 by using a function alpha ═ (tanh (x) +1)/2, so that a final foreground mask is obtained.

The fifth residual error module comprises: a main branch and an upsampling branch, wherein: the main branch comprises two layers of deconvolution and corresponding spectrum normalization, batch normalization and ReLU activation functions, and the up-sampling branch comprises one layer of nearest neighbor sampling layer, one layer of 1 × 1 convolution and corresponding spectrum normalization and batch normalization. The feature maps processed by the main branch and the up-sampling branch are subjected to element corresponding addition, and the addition result is used as the output of the residual module through a layer of ReLU activation function.

As shown in fig. 2a and 2b, the codec network adopts training based on progressive ternary diagram deformation, gradually reduces the areas of the foreground and the background in the guide diagram during the training process, gradually changes the guide diagram from a ternary diagram to a sketch, gradually reduces the amount of the determined information input to the neural network, and enables the codec network to learn to predict the foreground mask by using the given foreground and background information in the guide diagram, which specifically includes:

firstly, acquiring a foreground mask of a label from a DIM data set for each training step number, and then carrying out corrosion expansion on the foreground mask to obtain a ternary diagram;

the corrosion expansion refers to that: firstly, randomly selecting the size of a filter kernel in the range of 1-30, then respectively using the size of an erosion filter kernel and an expansion filter kernel to carry out erosion and expansion operations on a foreground mask of a label to obtain an erosion graph and an expansion graph, and drawing a white foreground area (with the value of 1) in the erosion graph and a black background area (with the value of 0) in the expansion graph on a full-gray (with the value of 0.5) graph to obtain a ternary graph.

Respectively sampling foreground and background areas of the ternary map to obtain key points, and respectively using curve fitting of the foreground key points and the background key points to obtain a curve function;

the sampling refers to: randomly sampling 10 key points in the foreground and background areas of the ternary map according to uniform distribution to obtain the coordinates of the 10 key points.

The curve fitting is as follows: and fitting a curve by using a cubic function every three foreground key points or three background key points, obtaining a straight line connecting the two points when only two points exist, and obtaining a circle when only one point exists, wherein the diameter of the circle is equal to the thickness of the curve and the straight line.

Thirdly, controlling the thickness of the curve according to the training steps to obtain a foreground sketch and a background sketch;

the thickness of the control curve refers to that: setting the thickness of a curve as d, setting the value of d at the initial training stage as a ternary diagram, setting the value of d at the later training stage as a ternary diagram, setting the value of d as a sketch, drawing a curve, a straight line and a circle obtained from a foreground key point and a curve, a straight line and a circle obtained from a background key point on two diagrams with all values as 0 (black) according to the thickness of d, and obtaining the foreground sketch and the background sketch.

As shown in fig. 7, the curve thickness initial value is 800, which decreases as the number of training steps increases, and is 40 at 530000, and then is maintained at 40 until 60 ten thousand training steps are completed. The functional relationship between the thickness before step 530000 and the number of training steps is

Fourthly, removing redundant parts in the foreground sketch and the background sketch by using the foreground and background masks obtained from the ternary diagram;

the foreground and background masks obtained from the ternary diagram refer to: the ternary diagram is divided into a foreground region (with the value of 1), a transition region (with the value of 0.5) and a background region (with the value of 0), and corresponding masks are generated according to the foreground region and the background region respectively, namely the foreground mask and the background mask.

And fifthly, fusing the foreground sketch and the background sketch from which the redundant parts are removed to obtain a guide drawing, and gradually reducing the area of the foreground and the background by gradually reducing the thickness of the curve along with the increase of the training steps so as to gradually change the guide drawing from a ternary drawing to the sketch.

The fusion of the foreground sketch and the background sketch refers to that: having been freed of superfluous partsForeground sketch P_FGAnd background sketch P_BGThen using the formula 0.5+ 0.5P_FG+(-0.5)*P_BGAnd obtaining a guide map for training.

The loss function of the whole neural network is

Wherein: k is the region of the foreground and the background, T is the transition region except the foreground and the background,

and α are the predicted foreground mask and the label foreground mask, respectively, the method uses L2 penalty in the foreground and background regions and L1 penalty in the transition region. Using gradient losses simultaneously

Is defined as at

And α with L1 losses.

All convolutional layers in the codec network of the present embodiment are processed using spectral normalization (SpectralNorm).

This example was performed on an experimental test platform (CPU: AMD R53600, GPU: GTX2080Ti) using a deep learning framework of Pythrch.

The data set of this embodiment uses a DIM data set, which contains 43100 backgrounds and 431 foreground objects, and needs to be synthesized into a complete picture by a program to be sent to the network for training. The input image size was adjusted to 512 x 512 during training, the number of training steps was 60 ten thousand iterations, and the batch size was 10. The data enhancement mode used in the training process is random affine transformation and the progressive ternary graph transformation strategy provided by the invention. Firstly, random affine transformation is carried out on the foreground and the corresponding label foreground mask, and then formula I is used_i＝α_iF_i+(1-α_i)B_iAnd combining the foreground and the background into a picture, wherein alpha is a foreground mask. This is followed by a progressive ternary diagram morphing process, as shown in FIG. 7, where the ternary diagram is generated by applying erosion dilation to the foreground mask, where the erosion dilation uses a filter kernel of random size in the range of 1-30. After obtaining the ternary diagram, randomly sampling key points in the foreground and background areas, wherein the total number is 10, then fitting a curve by taking three foreground or background key points each time by using a cubic function, simultaneously controlling the thickness of the curve to obtain a foreground sketch and a background sketch, and then respectively carrying out element multiplication on the foreground sketch and the background sketch by using the foreground and background masks of the ternary diagram to obtain a foreground sketch P with redundant parts removed_FGAnd background sketch P_BGThen using the formula 0.5+ 0.5P_FG+(-0.5)*P_BGAnd synthesizing the foreground sketch and the background sketch into a guide map finally used for training. During training, the curve thickness d decreases with increasing number of training steps, as shown in fig. 7.

The iterative optimization matting method provided by the embodiment. In the use phase, the foreground mask is predicted by inputting the guide map and the original map into the network. In this process, it is difficult to achieve the desired effect at one time, and therefore, the guide map needs to be continuously modified. The method supports the retention of the guide diagram input into the network last time, and the user can modify the guide diagram on the basis of the guide diagram. As shown in fig. 4b, the first-time matting effect is not ideal, and the second-time addition of some sketches can already show the outline of the foreground, but the effect of the detail part is not good, so that the addition of the detail prompting sketches can be continued to achieve a better effect. Through the iterative optimization matting method, the user can continuously modify the guide graph until an ideal effect is achieved.

TABLE 1 DIM test set (composition-1k) test accuracy

Tests were performed on a DIM test set, with 4 evaluation indices of Sum of Absolute Difference (SAD), Mean Square Error (MSE), Gradient Error (Grad), and Connectivity Error (Conn), respectively. Wherein Trimap test represents that the test uses a ternary diagram, Scribblemap test represents that the test uses a sketch, Clickmap test represents that the test uses a click diagram, and No-Guidance test represents that a guide diagram with a value of 0.5 is input without any foreground and background prompts. All the methods are divided into three parts by using dotted lines in the table, the top is a method based on a ternary diagram, the middle is a matting method without guide information, and the method is used below.

Comparing the result of the embodiment with the method based on the ternary diagram can find that the accuracy of the method exceeds all methods based on the ternary diagram under the condition of inputting the ternary diagram, the input of the embodiment is simpler for a user under the condition of inputting a sketch and a click diagram, and meanwhile, the accuracy exceeds all methods based on the ternary diagram. In the case of no guiding information input, the result of this embodiment is far superior to the previous non-guiding information matting method. Comparing the four test results of this embodiment, it can be seen that the more accurate the given guidance information is, the better the matting result is.

Compared with the prior art, the invention can achieve SAD30.1 precision on a DIM test set by using DIM training set training; compared with a ternary diagram-based method, the method can be used for matting only by draft or clicking, users can use the draft or clicking to matte in a simple prospect, time is saved, and users can draw a more detailed ternary diagram to matte in a prospect with high difficulty; furthermore, modification can be carried out on the basis of the guide image input last time, and the matting result is continuously optimized.

The foregoing embodiments may be modified in many different ways by those skilled in the art without departing from the spirit and scope of the invention, which is defined by the appended claims and all changes that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims

1. A matting method based on a coder-decoder network and a guide image is characterized in that a guide image is drawn according to an original image and is subjected to first prediction through the coder-decoder network to obtain a foreground mask, the original guide image is modified according to the predicted foreground mask and is subjected to prediction again through the coder-decoder network, the process is repeated in a circulating mode until an accurate foreground mask is obtained, and then a foreground to be scratched is obtained through the accurate foreground mask and an input image;

2. A matting method based on codec network and guide map according to claim 1 characterized in that said semantic information fusion module includes: the feature pyramid enhancement unit and the joint upsampling unit are cascaded, wherein: the feature pyramid enhancement unit extracts multi-scale features from the backbone network, fuses the features, enhances semantic information, and the combined upsampling unit upsamples the enhanced features to obtain deep semantic features.

3. The codec network and guide map based matting method according to claim 1, characterized in that when the foreground mask predicted by the codec network does not reach the expected effect, the original guide map is modified and predicted again by the codec network, and then judged again and cycled until an accurate matting result is obtained.