CN113361530A

CN113361530A - Image semantic accurate segmentation and optimization method using interaction means

Info

Publication number: CN113361530A
Application number: CN202010149919.2A
Authority: CN
Inventors: 狄休; 张丽清
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2020-03-06
Filing date: 2020-03-06
Publication date: 2021-09-07

Abstract

A method for accurately segmenting and optimizing image semantics by using an interaction means includes the steps of realizing interaction by inputting mark delimitation points by a user, extracting semantic features by using a deep convolution network to realize preliminary segmentation of an image to be segmented, optimizing a result of the preliminary segmentation by using bottom layer features of the image to be segmented by constructing a conditional random field model of the image to be segmented, and finally further optimizing an accurate segmentation result by using an interaction means. The invention uses the delimitation point as the interactive input, the operation is simple and convenient, and the segmentation effect is improved; extracting image semantics by using a full convolution neural network, wherein a segmentation result has semantic integrity; the user is allowed to guide modification through manual intervention, and then the image bottom-layer features are used for assisting in optimizing the result.

Description

Image semantic accurate segmentation and optimization method using interaction means

Technical Field

The invention relates to a technology in the field of image processing, in particular to a convenient and fast image accurate segmentation and optimization method which uses a delimiting point as an interaction means and integrates image bottom layer characteristics and semantic information.

Background

The image segmentation is a process of dividing an image into a plurality of specific areas with unique properties and extracting an interested target, and is a key technology in digital image processing. Currently, the common information input modes include three modes, namely frame selection, smearing and clicking. The frame selection can provide the most abundant information, but redundancy cannot be removed at the same time, a user uses a rectangular frame to specify the area where the target object exists, and according to the difference of the shape and the type of the object, the situation that a large number of backgrounds are contained in the frame at the same time may occur, so that the segmentation effect is influenced; in order to reduce interference of irrelevant information, a user can use a smearing mode to indicate a specific area as a reliable foreground, accuracy of information is guaranteed while a large number of foreground areas are provided, but the size of the smearing area can influence a segmentation effect to a great extent, complicated interaction is needed, and an ideal effect is difficult to achieve under the conditions of more object blocks or rich textures; the click mode provides less accurate information, is a convenient interactive mode and is generally used for the specification of the segmentation boundary or the further optimization of the segmentation result. The three interaction modes have advantages and disadvantages, a single type of input often cannot meet actual needs, and a common image segmentation method adopts a plurality of interaction combination means, for example, GrabCut uses frame selection to roughly determine the area where an object is located, and allows a user to designate a foreground and a background through smearing to optimize a segmentation result; LazySnapping uses smearing to specify foreground and background pixels and uses click input to edit the boundary.

The technical problems to be solved urgently by the existing image segmentation method are as follows: 1) in the traditional image segmentation method, high-level semantic information is absent. 2) When the segmentation effect is not good enough, a convenient manual intervention means needs to be provided. 3) The existing interaction means is cumbersome and inefficient. 4) The calculation efficiency is improved, and the processing time is shortened.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides an image semantic accurate segmentation and optimization method using an interaction means, wherein image semantic information is extracted through a convolutional neural network structure, and semantic features and image features are fully combined to realize segmentation; the target delimitation point replaces the traditional interaction mode to be used as user input, so that the segmentation effect is improved, and the interaction operation is simplified; the user is allowed to further modify the segmentation result through interaction, and the ideal segmentation effect can be realized on a wide range of objects.

The invention is realized by the following technical scheme:

the invention relates to an image semantic accurate segmentation and optimization method using an interaction means, which realizes interaction by inputting a mark delimiting point by a user, extracts semantic features by using a deep convolutional network to realize preliminary segmentation of an image to be segmented, optimizes a result of the preliminary segmentation by using bottom layer features of the image to be segmented by constructing a conditional random field model of the image to be segmented, and further optimizes an accurate segmentation result by using an interaction means.

The preliminary segmentation, namely the image segmentation of the semantic layer, is realized by a convolutional neural network with a structure of ResNet101+ DeepLabV 2.

The conditional random field model is a graph model established by using a fully connected conditional random field, and the position and color correlation information between pixel points is used as the bottom layer characteristics for optimizing the primary segmentation.

The invention relates to a system for realizing the method, which comprises the following steps: user interaction unit, preprocessing unit, semantic segmentation unit and boundary optimization unit, wherein: the user interaction unit is connected with the preprocessing unit and the boundary optimization unit and transmits user interaction information, the preprocessing unit is connected with the semantic segmentation unit and transmits processed images and interaction information, the semantic segmentation unit is connected with the boundary optimization unit and transmits image semantic segmentation information, and the boundary optimization unit receives the semantic segmentation information and the user interaction information at the same time, calculates and feeds back an optimization result.

The boundary optimization unit uses the conditional random field model to model the underlying features to assist in optimizing the semantic segmentation result, the robustness of the segmentation result is enhanced, the user can inform the system by clicking a wrong region, the system updates the conditional random field model and infers and updates the segmentation result, and the user obtains instant effect feedback.

Technical effects

The invention integrally solves the technical problems of missing high-level semantic information, poor user interaction mode and incapability of manual intervention and correction in the traditional method; compared with the prior art, the invention has the technical effects that:

1) the delimiting points are used as interactive input, so that the operation is simple and convenient, and the segmentation effect is improved.

2) And extracting image semantics by using a full convolution neural network, wherein the segmentation result has semantic integrity.

3) The user is allowed to guide modification through manual intervention, and then the image bottom-layer features are used for assisting in optimizing the result.

Drawings

FIG. 1 is a diagram of a delimiter point versus a frame selection interaction;

FIG. 2 is a flow chart of the present invention;

FIG. 3 is a diagram illustrating comparison of the optimization effect of conditional random fields;

FIG. 4 is a schematic view showing the effect of manual intervention modification;

fig. 5-9 are experimental result presentation graphs.

Detailed Description

As shown in fig. 2, this embodiment relates to an image semantic accurate segmentation and optimization method using an interactive means, which cuts out a subgraph including a target region from an original image by acquiring four delimiting points input by a user and performs preprocessing, and inputs a trained convolutional neural network to perform semantic segmentation to obtain a preliminary segmentation result; and then, constructing a fully-connected conditional random field by using the position and color information of the pixels in the graph, optimizing and displaying the segmentation result by using the bottom layer characteristics, waiting for the user to check the effect, clicking the segmentation error region, determining the foreground and background attributions of the corresponding region, updating and calculating a conditional random field model by using an algorithm, correcting the region which is not satisfied by the user until the interaction is terminated, and storing the segmentation result.

The embodiment specifically comprises the following steps:

step 0) user interaction: defining the pixel points closest to the four boundaries of the upper, lower, left and right in the target object image as the delimitation points of the object, and marking the four delimitation points of the object by a user in a clicking mode.

Step 1) image preprocessing: the method comprises the steps that the size of an original image is large, most information is irrelevant to calculation, an algorithm calculates an external rectangular frame according to four delimitation points input by a user, a frame is filled into a square, 50 pixels are expanded to obtain a cutting frame, and if the cutting frame exceeds the boundary of the image, zero filling operation is needed, the size of the obtained square cutting frame is changed to 512 pixels by 512 pixels. The step retains neighborhood information, improves robustness, reduces algorithm operation amount and improves execution efficiency. And respectively taking the four delimiting points as centers, calculating responses by using a normal distribution function, superposing the responses, converting the single-point input which is discrete in space into a continuous input response graph, and combining the continuous input response graph and the cut 3-channel image into a 4-channel image as the input of the next step.

Step 2) the neural network performs semantic segmentation: adopting a pre-trained neural network to receive the four-channel input generated in the step 1), extracting the semantic features of the image, performing semantic segmentation by combining a response image, and outputting a segmentation result. Because the network model has better learning ability, the network model can be optimized in a targeted manner according to application scenes and data types, and has better flexibility. The image processing process in this step may be accompanied by the reduction of the size of the image, and the output needs to be interpolated and enlarged to restore to the original size. And the output probability graph represents the possibility that each point belongs to the foreground, and the probability graph is binarized to obtain a segmentation mask.

In the selection of the model of the convolutional neural network, a DeepLabV2 structure is used, ResNet-101 is used as a feature extraction backbone model, the network has wide application in the fields of deep learning and machine vision and is proved to have a better feature extraction effect, the introduction of a cavity convolution structure can ensure that the receptive field of a convolution kernel is enlarged while the size of an image is not reduced, and then a Pyramid Scene matching module is connected to comprehensively analyze the content of the image on different scales by utilizing the Pyramid structure to refine global context information.

The data of the training network can be labeled and prepared by self, two parts of data are used for network training in the implementation process, the two parts of data are respectively a PASCAL data set with fine labeling and a COCO data set with coarse labeling, and the total number of the data comprises 9 thousands of pictures which are public data. In order to simulate user input, the method respectively takes four delimiting points for the image labeling object, and random disturbance within 10 pixels is added to enhance the robustness of the model. The neural network learns the high-level semantic features of the image through the data set and fits the segmentation result. The trained model has a good segmentation effect on a wide range of objects, and a semantic segmentation module is formed by accessing a system. The network fabric can be replaced and custom optimized for other tasks.

The neural network has a good semantic segmentation result but has a poor effect on edge details, the output effect of the step is influenced by the accuracy of user interactive input to a certain extent, and some deviation of the delimiting point may cause misjudgment of a part of regions in the result.

Step 3), realizing boundary optimization by the conditional random field: the semantic segmentation result output by the neural network is amplified and restored to possibly generate the problem of unsatisfactory edge effect, meanwhile, the method considers that the user can not always accurately mark the delimiting point of the object by clicking, and further optimizes the segmentation result by using the bottom layer characteristics of the image in order to enhance the robustness of the algorithm. The method uses a fully-connected conditional random field to model the correlation between the pixels of the full graph, and realizes the optimization of the segmentation boundary by redistributing the foreground and background category affiliations of the pixels.

The method uses a conditional random field to model the bottom layer characteristics of the image: regarding the image as a graph model, each vertex of the graph represents a pixel, edges among the vertices represent the dependency relationship among corresponding pixels, and the Gibbs energy formula of the fully-connected conditional random field is specifically adopted as follows:

wherein: i, j are pixel indices,

set of all pixels, x_iIs the class of pixel i, Ψ_iIs unitary potential energy of the pixel i, is the initial value of probability distribution of pixel class, is generated by the preliminary segmentation result of the step 2), and is converted by using a negative logarithm function_i(x_i)＝-log P(x_i) Wherein: p (x)_i) Assigned to pixel point i as class x_iThe probability of (d); Ψ_ijFor binary potential energy, a graph model obtained by modeling the dependency relationship between pixel points comprises:

Ψ_ij＝g(i，j)[x_i≠x_j]，

g(i，j)＝w₁g₁(i，j)+w₂g₂(i，j)+w₃g₃(i，j)，

wherein: p is a radical of_iIs the color information of pixel point I, I_iIs the coordinate information of pixel i, g (i, j) is the weighted correlation of pixel i and pixel j [ ·]For the label penalty function, where the condition is true is 1, otherwise 0, it can be understood that pixels with similar positions or colors are assigned the penalty of different labels, g₁The function is the color correlation of neighboring pixels, g₂For the position dependence of the global pixel, g₃Is the color dependency of the global pixel, θ_{αβ，γ，μ}Calculating a parameter for the correlation, parameter w_1，2，3The weight proportions of the different correlations are adjusted.

Preferably, for the constructed conditional random field model, the method uses "Philipp

and Vladlen Koltun, effective Inference in full Connected CRFs with Gaussian Edge Potentials, NIPS 2011' to obtain the maximum posterior probability under the constraint condition, and binarize through a threshold value to obtain the segmentation result after smooth optimization.

According to the method, conditional random field modeling is performed on the cut small image, so that local details of the image can be concentrated, and the calculation time is reduced. The method uses a probability graph obtained by a convolutional neural network to calculate unitary potential energy of the pixel points. In binary potential energy calculation, a plurality of parameters are used for balancing the proportion of different kinds of pixel correlation in global relation, and the method sets w in experiments₁＝6，w₂＝7，w₃＝1，θ_α＝15，θ_β＝3，θ_μ＝7，θ_γ1. Such a parameter configuration can better focus on the color similarity with the adjacent area, thereby optimizing the edge details and reducing the generation of wrong segmentation.

Step 4) clicking an intervention modification result: and (4) observing the output result of the step by the user, and clicking any misjudged area in the image to correct the error. The system acquires the click of the user and records the position information, the foreground and background judgment of the current pixel is negated, the foreground is changed into the background, the background is changed into the foreground, the unary potential energy in the energy function in the conditional random field is updated, and the posterior probability reasoning is carried out again. Because the change of the information of the single pixel point only has little influence on the overall result, the method in the step uses the flooding method to expand the area of the user click, and similarly, the user can define and modify the area more finely if selecting the smearing input. These regions are considered to have higher confidence and more extreme response values in the unitary potential energy function, sufficiently optimizing the segmentation result.

And taking the area which is clicked by the user in a further interaction manner as a misjudged pixel, and executing correction calculation. The algorithm carries out region growing from the interaction points, carries out color clustering search in the range close to the rectangle, updates the unitary potential energy function of the partial region in the conditional random field, sets the probability of the foreground region appointed by the user to be 0.95 and the probability of the background region to be 0.05, and carries out reasoning of the conditional random field.

As shown in fig. 1, the distinction between the delimiting points and the box interaction. Meanwhile, random disturbance is added in the training process, the robustness of the model is enhanced, and the result cannot be influenced even if the input of the user has a certain deviation. According to the method, a conditional random field model is established on image data, and the semantic segmentation result is optimized by using the bottom layer characteristics of the image. On the basis, the user can click on the area with segmentation error to perform correction, and the ideal segmentation effect can be obtained even if the calculation model fails.

As shown in FIG. 3, under the condition of low user input quality, the automatic optimization effect of the conditional random field method on semantic segmentation errors proves the prominent effect of the underlying features on detail optimization.

The specific implementation environment of this embodiment is: the CPU of the hardware platform is Intel Xeon E5-2660v3@2.60GHz, the video card is NVIDIA Tesla K80, and the performance of the algorithm is tested according to the parameters in the embodiment. The method uses the GrabCut data set as a test image to develop a contrast experiment, and evaluates the segmentation performance and the disturbance rejection capability of the algorithm. The method selects GrabCut which is most representative in the traditional method as a comparison object. In order to ensure the fairness of the experiment, the method limits the use of similar interaction means, uses four delimiting points as input, uses a target frame obtained by the delimiting points as input, and does not use any auxiliary information in addition. As shown in fig. 4, the results of the original image, the primary modification and the secondary modification are compared in sequence.

Experiment 1: the method uses the IOU index (intersection/union ratio) commonly used in the segmentation task as the evaluation standard of the segmentation effect, and compares the segmentation effect of the method with that of GrabCT, and the result is shown in FIG. 5. The segmentation effect of the method is superior to that of the GrabCut method, and the method has obvious advantages on partial images.

Experiment 2: considering that interaction errors exist in the using process of a user, the method tests the performance and stability of the algorithm under different interferences. In order to distinguish the contributions of different modules to the interference resistance, the method respectively counts the result of the experimental semantic segmentation and the result of the boundary optimization to obtain 3 groups: semantic segmentation, boundary optimization and GrabCut benchmark groups. Dividing the disturbance degree into a, no disturbance and using the result in the experiment 1; b. small interference, the maximum range of disturbance is 10 pixels; c. large disturbances, the maximum range of disturbance being 20 pixels; random position offsets in the range are generated independently for the four delimiting points respectively, 10 times of experiments are carried out on each interference group to calculate the IOU index average value, and the experiment results are shown in fig. 6-9. The analysis leads to the following conclusions: 1) compared with the GrabCont method, the semantic segmentation realized by the convolutional neural network and the boundary optimization unit of the conditional random field have better robustness. 2) In a specific case, a small amount of disturbance can improve the segmentation effect of the method, because interference factors are added in the training process. 3) Under large disturbance, boundary optimization by using bottom layer characteristics can provide stronger robustness and effect stability for the method.

Compared with the prior art, the method can obviously improve the image segmentation effect, namely the IOU index and the system robustness under different degrees of interference.

The foregoing embodiments may be modified in many different ways by those skilled in the art without departing from the spirit and scope of the invention, which is defined by the appended claims and all changes that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims

1. The method is characterized in that interaction is realized by inputting mark delimitation points through a user, semantic features are extracted by using a deep convolutional network to realize preliminary segmentation of an image to be segmented, then a conditional random field model of the image to be segmented is constructed, the result of the preliminary segmentation is optimized by using the bottom layer features of the image to be segmented, and finally the accurate segmentation result is further optimized through the interaction means;

2. The method for precisely segmenting and optimizing image semantics according to claim 1, wherein the step of marking a delimiting point by a user is as follows: defining the pixel points closest to the four boundaries of the upper, lower, left and right in the target object image as the delimitation points of the object, and marking the four delimitation points of the object by a user in a clicking mode.

3. The method for image semantic accurate segmentation and optimization according to claim 1, wherein the preliminary segmentation, i.e. the image segmentation of semantic layers, is realized by a convolutional neural network with a structure of ResNet101+ DeepLabV 2.

4. The method for image semantic accurate segmentation and optimization according to claim 1, wherein the network training data of the convolutional neural network comprises: marking a relatively fine PASCAL data set and a rough COCO data set; four delimiting points are respectively taken for the image labeling object, and random disturbance within 10 pixels is added to enhance the robustness of the model.

5. The method for precisely segmenting and optimizing image semantics according to claim 1, wherein the step of constructing the conditional random field model of the image to be segmented is as follows: regarding the image as a graph model, each vertex of the graph represents a pixel, edges among the vertices represent the dependency relationship among corresponding pixels, and the Gibbs energy formula of the fully-connected conditional random field is specifically adopted as follows:

wherein: i, j are pixel indices,

set of all pixels, x_iIs the class of pixel i, Ψ_iIs the unitary potential of pixel i, is the initial value of the probability distribution of pixel class, is generated from the preliminary segmentation result, and is converted by using a negative logarithmic function, Ψ_i(x_i)＝-logP(x_i) Wherein: p (x)_i) Assigned to pixel point i as class x_iThe probability of (d); Ψ_ijIs a binary potential energy of Ψ_ij＝g(i,j)[x_i≠x_j]，g(i,j)＝w₁g₁(i,j)+w₂g₂(i,j)+w₃g₃(i,j)，

Wherein: p is a radical of_iIs the color information of pixel point I, I_iIs the coordinate information of pixel i, g (i, j) is the weighted correlation of pixel i and pixel j [ ·]Is a tag penalty function, which is 1 if the condition is true, and 0, g if not₁The function is the color correlation of neighboring pixels, g₂For the position dependence of the global pixel, g₃Is the color dependency of the global pixel, θ_α,β,γ,μFor calculating the parameter for the correlation, the parameter w is used_1,2,3The weight proportions of the different correlations are adjusted.

6. The image semantic accurate segmentation and optimization method according to claim 1 or 5, characterized in that conditional random field modeling is performed on the clipped minimap, and unitary potential energy of pixel points is calculated using a probability map obtained by a convolutional neural network;

in binary potential energy calculation, multiple parameters are used to balance the ratio of different kinds of pixel correlation in global relation, and w is set₁＝6,w₂＝7,w₃＝1,θ_α＝15,θ_β＝3,θ_μ＝7,θ_γ＝1。

7. A system for implementing the method of any preceding claim, comprising: user interaction unit, preprocessing unit, semantic segmentation unit and boundary optimization unit, wherein: the system comprises a preprocessing unit, a semantic segmentation unit, a boundary optimization unit, a conditional random field model, a preprocessing unit, a semantic segmentation unit, a boundary optimization unit and a feedback unit, wherein the preprocessing unit is connected with the semantic segmentation unit and transmits processed images and interactive information, the semantic segmentation unit is connected with the boundary optimization unit and transmits image semantic segmentation information, the boundary optimization unit receives semantic segmentation information and user interactive information simultaneously and calculates and feeds back an optimization result, and meanwhile, the conditional random field model is used for modeling a bottom layer characteristic to assist in optimizing the semantic segmentation result, so that the robustness of the segmentation result is enhanced, a user can inform the system through clicking a wrong region, the system updates the conditional random field model and infers and updates the segmentation result, and the user can obtain instant effect feedback.