CN111161286B

CN111161286B - Interactive natural image matting method

Info

Publication number: CN111161286B
Application number: CN202010000472.2A
Authority: CN
Inventors: 乔羽; 杨鑫; 魏小鹏; 张强; 尹宝才
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2020-01-02
Filing date: 2020-01-02
Publication date: 2023-06-20
Anticipated expiration: 2040-01-02
Also published as: CN111161286A

Abstract

The invention belongs to the technical field of computer vision, and provides an interactive natural image matting method which realizes a user-friendly simple interactive image matting frame and is divided into 5 stages of super-pixel division, information area selection, user scribing, markov propagation and CNN propagation. The whole framework realizes accurate and efficient image mask generation through simple user interaction, and some image details can be accurately predicted. The combination of Markov chains and deep learning effectively propagates and diffuses labels on a limited neighborhood and a full graph, and the maximum labeling of limited user interaction is realized. The invention obtains more accurate matting mask, saves a great amount of operation time compared with the matting algorithm depending on the three-part graph, and realizes the effective balance between the image matting accuracy and the cost.

Description

Interactive natural image matting method

Technical Field

The invention belongs to the technical field of computer vision, and relates to an interactive natural image matting method of a Markov chain and deep learning.

Background

In recent years, with the continuous development of the internet and mobile devices, images become an integral part of human life, and corresponding image related processing technologies are also gradually developed in the research field along with the needs of the public. Image classification, semantic segmentation and other technologies are hot topics in the field of image processing, but with the improvement of industrial and living application standards such as movie production, online live broadcast, portrait beautification and the like, traditional image segmentation cannot meet the requirements of people on prospect refinement, and corresponding image matting is paid more attention to and paid more attention to. Compared with image segmentation, the matting technique not only requires the segmentation of objects in the image, but also requires some fine image details such as hair, animal hair, plant branches and the like, and can be accurately displayed in the segmentation result. Such refined segmentation can better meet the pursuit of people for high quality life, and is more attractive and challenging for some industries, research and other fields.

The image matting requires accurate segmentation to pixels, and the corresponding starting point is the most basic image synthesis formula:

I _z ＝αF _z +(1-α)B _z ,α∈[0,1] (1)

where z represents a pixel point in the image, I represents a z point actually observed by us, F and B represent a foreground value and a background value obtained by the z point, respectively, and α represents a proportion of the foreground and the background, which can also be regarded as transparency from the foreground to the background. The image synthesis formula defines the pixel level interpretation of the most basic image imaging: each pixel in the image is composed of a foreground and a background, α=1 indicates that the z point is foreground, i.e., completely opaque, and α=0 indicates that the z point is background, i.e., completely transparent. The area of alpha epsilon (0, 1) represents the critical area of the foreground and the background (such as human hair, branches of plant melamine hemp, semitransparent area and the like) of the pixel point, and the areas are formed by mixing the foreground and the background, namely the transition area of the image matting needing to be solved. The gray image formed by the alpha value is the target image mask (alphamate) of the matting technique.

The formula (1) is an under-constrained formula, and for common color RGB three-channel pictures, 7 variables to be solved exist, so that the existing matting algorithm is dependent on additional auxiliary pictures and combines the input RGB pictures to realize the calculation of the image mask. The main auxiliary pictures are of two main types: three-part drawings and pen drawings. The trimap image divides the original color picture into three parts of white, black and gray, wherein the white area represents the foreground, the black area represents the background, and the gray area represents the transition area between the foreground and the background. Drawing a picture is simply drawing several strokes over the picture to designate foreground, background and transition regions. Both auxiliary pictures can classify some pixel points in the image, and the problem solved by the matting algorithm is how to realize the calculation of the mask of the whole image according to the existing information provided by the auxiliary pictures.

Although the trimap image can provide the whole annotation, a lot of time and effort are required for the annotation, and the feasibility in practical application is poor. While the stroke graph is friendly to the user, the user can calculate the image mask by drawing a few strokes, obviously the number of strokes plays a decisive role for the mask quality, and the positions of the strokes also need to satisfy some assumptions or prior distributions of boundary smoothness. In recent years, many methods based on deep learning have been developed, but they mostly rely on trimap images to guarantee the quality of predictive masking, and deep learning methods are usually trained on synthetic datasets, resulting in their poor effect on real-world pictures.

Disclosure of Invention

Aiming at the defects of the existing method, the invention provides an interactive image matting framework based on cascade propagation. The framework can autonomously select a plurality of areas meeting the prior distribution or edge smoothness assumption of the matting algorithm, which are called information areas, and then a user simply draws the information areas to mark the foreground, the background and the transition areas. The category labels given by the user are firstly transmitted among the neighborhoods through a Markov chain, then category transition probabilities among super pixels are updated, and then the transmission of the whole picture is carried out through a Convolutional Neural Network (CNN), so that the final refined image mask is realized.

The technical scheme of the invention is as follows:

the interactive natural image matting method comprehensively considers the accuracy and time requirements of mask calculation, realizes the balance between the labor of a user and the mask precision, adopts a model comprising five stages, and has the following specific technical scheme:

(1) Super-pixel partitioning stage

The super-pixel dividing stage is mainly used for dividing super pixels of an input image; the super pixels are pixel blocks representing color and cultural characteristics, the follow-up information area is calculated, the propagation of the user labels is carried out by taking the super pixels as a unit, and the propagation efficiency is improved by utilizing the similarity among the super pixels while the calculation time is obviously reduced;

(2) Selection phase of information area

The traditional method for drawing the strokes has strict requirements on the positions of the strokes, and the strokes need to meet some edge smoothness assumptions or priori distributions of a matting algorithm, so that the user is required to have rich expertise on matting. In order to improve the user friendliness in the line drawing process, the invention provides an automatic region selection algorithm, and a specific flow is shown in fig. 1. Dividing an input image into 16 areas on average, calculating the information quantity of each area according to the superpixel information in the areas, and comprehensively considering the color, the texture, the label entropy and the object boundary information in each area by calculating the information quantity:

Inf o＝I+J+E+S (2)

wherein: I. j, E and S represent the similarity between the region and other regions, the diversity of the inner superpixels, the existing label entropy, and the object boundary information contained inside, respectively; the variables in equation (2) are all calculated in units of superpixels, where I and J refer to the color and the cultural information of the superpixels, and are specifically defined as follows:

wherein cm _i 、ch _i And th _i Respectively, the color mean, color histogram and cultural histogram of the corresponding superpixel i, θ being to prevent bias of the dividend 0, λ ₁ ，λ ₂ And lambda (lambda) ₃ As balance coefficients, 0.4,0.35 and 0.25 are taken in actual operation respectively;

j, calculating the whole negative number; i considers the similarity between the region and other regions, and the region with high similarity with other regions can express the integral characteristics of the image; j, considering the difference between the super pixels in the region, wherein the region with larger internal difference is more likely to be positioned in the transition region between the foreground and the background, and is more significant for the marking process of the user;

the label entropy is defined as follows:

wherein pb is _i 、pu _i And pf _i The probability that the super pixel i belongs to the background, the transition region or the foreground is respectively represented, the super pixel which is marked or can be clearly classified after propagation obviously does not need to be marked any more, the corresponding E calculation is affirmatively lower, the regions which are not marked by a user and cannot be processed in the propagation process are corresponding to the initially larger class probability, and the constraint of the label entropy has a larger possibility of being selected for carrying out the next user line drawing;

calculation of object boundaries refers to corresponding values in the boundary map:

wherein e _i Is the boundary mapping value of the corresponding super pixel according to the pixel set psi inside the super pixel _i Summarized em _k Representation of ψ _i The values of the pixel points k in the boundary map, delta and epsilon are balance coefficients according to the psi _i The number of pixels in the image is calculated, and a reference boundary map pattern is shown in FIG. 2 (b) (the values are shown as [0,1 ]]And white represents a pixel that may be located at the boundary of the object).

(3) User line drawing stage

According to the calculation formula of the regional information quantity, the information quantity Info of the corresponding region is calculated in each iteration, then the region with the largest information quantity is selected for marking the user line, the user only needs to mark the foreground, the transition region and the background respectively by red, green and blue, no special skills are needed, each category only needs to draw one or two pens, and the method is quite friendly for the user.

(4) Markov propagation

After a user draws a line, the marking information is maximally transmitted to the whole image, so that exquisite image mask calculation is realized, and label diffusion in the neighborhood is realized by means of a Markov chain; each super pixel is regarded as a node in a Markov chain, and a probability transition matrix is constructed according to the color and the grammatical similarity among the super pixels, wherein the super pixels with clear label information are regarded as absorption nodes; updating probability matrix after each line drawing is finished, and calculating latest transition probability pb from each node to different class absorption nodes _i 、pu _i And pf _i The final trimap image is calculated by the final probability transition matrix; the process of information region selection, user line drawing and Markov propagation is iterated (figure 1), and the propagation of the marking information in the neighborhood is realized to the maximum extent; in order to balance between user labor and mask quality, the number of iterations is empirically set to 6, that is, only 6 of the 16 divided regions are selected for user line drawing;

(5) Convolutional Neural Network (CNN) propagation

Markov propagation ensures efficient diffusion of user line markers within the neighborhood, whereas CNN propagation enables prediction of marker information across the map, with the network structure shown in FIG. 3. Taking the circumscribed rectangle of each super pixel, taking the super pixel with the user mark as a training set, training a network model, then predicting the class probability of the rest super pixels by using the trained network model, and finally completing updating the probability transition matrix; the network structure of CNN propagation is similar to a classical classification network, three classifications of foreground, background and transition areas are used as classification labels, and finally activation mapping is realized by softmax; training 20 epochs on the whole network, and using an exponentially decreasing learning rate and a cross entropy Loss function; a threshold of 0.65 is set to calculate the trimap image from the probability matrix and the final image mask is predicted from the existing trimap-dependent algorithm.

The invention has the beneficial effects that: compared with the existing image matting method, the method provided by the invention realizes accurate image mask calculation on the premise of ensuring user friendliness. The user can realize a satisfactory matting result by drawing a few simple strokes of marking the foreground, the background and the unknown region on the automatically selected region. The selection mode of the information area provided by the invention can effectively select the key area in the image, and has a larger reference value for the fields of computer vision and image processing. Meanwhile, the propagation mode of combining the Markov chain and the deep learning in the invention has important reference significance for the diffusion of the limited labels to the whole graph.

Drawings

FIG. 1 is an iterative flow chart of information region selection, user line drawing, and Markov propagation, with the overall process iterating 6 times, achieving in-neighborhood propagation of user labels.

FIG. 2a is an original input image; fig. 2b is an object boundary map. Fig. 3 is a network structure adopted in the CNN propagation process, and the configuration of super parameters such as a specific network layer, a convolution kernel, etc. as shown in the figure, the network layers such as convolution, pooling, etc. are all the most basic network layer structures.

Fig. 4 is a flow chart of the overall framework.

Detailed Description

The present invention will be described in further detail with reference to the following embodiments. The invention is an important technological improvement to the existing image matting method, compares the existing three-dimensional image and the two types of images of the stroke image, can acquire the image mask accuracy close to the exquisite three-dimensional image on the premise of saving a great deal of operation time and user work, achieves ideal balance between calculation cost and result quality, and is explained in detail below with the implementation effect of time and accuracy.

(1) Enforcing analysis on user labor and time costs

The information region selection, the Markov and CNN cascade propagation and other processes related in the invention are all automatically realized, the information region can effectively represent the characteristics of the whole image, the automatic selection can replace the professional requirements of users on scribing, and the users only need to draw a few strokes on the information region. The invention carries out comprehensive experimental analysis on the traditional stroke drawing method to obtain a rough image mask, which generally needs to provide stroke information covering about 8% of the whole graph, while the frame provided by the invention only needs to draw on 6 information areas, and the average stroke area only accounts for 3.6% of the whole graph. The invention effectively reduces the interaction workload of the user, and replaces the professional knowledge requirement of the line drawing on the user through automatic region selection.

The invention generates a high-quality trimap image through limited user drawing lines, and embeds the trimap image into the existing image matting algorithm depending on the trimap image to realize exquisite image masking. Compared with an exquisite trisection chart realized by the marking of a user, the method can save a large amount of calculation time, reduces the marking time from a few hours to about one minute, greatly improves the practicability of the image matting algorithm in an actual scene, and has great reference significance for some industrial and living applications.

(2) Accuracy analysis in practical implementation

Compared with the existing matting auxiliary image, the method can provide limited but effective interactive labels in a user-friendly mode, and guide mask calculation of the whole image. Meanwhile, the invention provides a high-efficiency propagation framework organically combined by a Markov chain and a convolutional neural network, wherein the Markov chain is responsible for the maximized propagation of the user mark in the space neighborhood, and the Markov chain is used for realizing the diffusion of the super-pixel label on the whole image by extracting high-dimensional semantic information in the super-pixel and comparing the similarity between the semantic information. The traditional machine learning and deep learning combined mode realizes the high fusion of super-pixel affinity and semantic similarity, and makes decisive contribution to image masking.

The information area selection mode provided by the invention has important reference significance for the later image processing and the computer vision field. Images are one of the main information carriers in today's society, and affect our lives from time to time. The related researches of significance detection, target detection and the like provide important references for researchers: although the image is a representation of the color value of each pixel point, some key areas exist in the image, and the attribute, classification, high-dimensional semantics and the like of the image can be determined. This is also important in the information region selection of the present invention, where automated region selection is used to replace the user's or other computer vision techniques' requirements for different image regions. The framework provided by the invention can automatically select the information area, and the information area contains important information such as representative color, grammar, label, object boundary and the like, and can reflect some attributes and characteristics of the whole image. Obviously, some image processing and computer vision related tasks are also focused on the information areas, and the induction and prediction of the whole image are realized by dividing and marking the information areas, so that the whole workload and the running time of the system can be saved maximally on the premise of ensuring the task quality. The information area selection mode provided by the invention can provide important references and references for the works.

Claims

1. The interactive natural image matting method is characterized in that the adopted model comprises five stages, and the specific steps are as follows:

(1) Super-pixel partitioning stage

The super-pixel dividing stage is mainly used for dividing super pixels of an input image; superpixels are blocks of pixels that represent color and cultural features;

(2) Selection phase of information area

Dividing an input image into 16 areas on average, calculating information quantity Info of each area according to super-pixel information in the areas, and comprehensively considering color, texture, label entropy and object boundary information in each area by calculating the information quantity:

Info＝I+J+E+S(2)

the label entropy is defined as follows:

wherein pb is _i 、pu _i And pf _i Respectively representing the probability that the super pixel i belongs to the background, the transition region or the foreground;

wherein e _i Is the boundary mapping value of the corresponding super pixel according to the pixel set psi inside the super pixel _i Summarized em _k Representation of ψ _i The values of the pixel points k in the boundary map, delta and epsilon are balance coefficients according to the psi _i The number of pixels in the pixel array is calculated, and the value is 0,1]Between, white represents a pixel that may be located at the boundary of the object;

(3) User line drawing stage

According to the calculation formula of the regional information quantity, the information quantity Info of the corresponding region is calculated in each iteration, then the region with the largest information quantity is selected for marking a user line, the user only needs to mark a foreground, a transition region and a background respectively by red, green and blue, and each category only needs to draw one or two strokes;

(4) Markov propagation

After a user draws a line, the marking information is maximally transmitted to the whole image, so that exquisite image mask calculation is realized, and label diffusion in the neighborhood is realized by means of a Markov chain; each super pixel is regarded as a node in a Markov chain, and a probability transition matrix is constructed according to the color and the grammatical similarity among the super pixels, wherein the super pixels with clear label information are regarded as absorption nodes; updating the probability transition matrix after each line drawing is finished, and calculating the latest transition probability pb from each node to different types of absorbing nodes _i 、pu _i And pf _i The final trimap image is calculated by the final probability transition matrix; the information area selection, the user line drawing and the Markov propagation are performed iteratively, so that the propagation of the marking information in the neighborhood is realized to the maximum extent;

(5) Convolutional neural network propagation

Taking the circumscribed rectangle of each super pixel, taking the super pixel with the user mark as a training set, training a network model, then predicting the class probability of the rest super pixels by using the trained network model, and finally completing updating the probability transition matrix; three classifications of foreground, background and transition areas are used as classification labels, and finally, activation mapping is realized by softmax; training 20 epochs on the whole network, and using an exponentially decreasing learning rate and a cross entropy Loss function; and setting a threshold value of 0.65 to calculate a trimap image according to the probability transition matrix, and finally predicting to obtain the image mask.