CN111768415A

CN111768415A - Image instance segmentation method without quantization pooling

Info

Publication number: CN111768415A
Application number: CN202010542619.0A
Authority: CN
Inventors: 苏丽; 孙雨鑫; 苑守正
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2020-06-15
Filing date: 2020-06-15
Publication date: 2020-10-13

Abstract

The invention provides a quantification-free pooling image instance segmentation method, which comprises the following steps: s1: inputting any two-dimensional image into a deep layer feature extraction network to obtain a multilayer feature map, and extracting a candidate region through a region recommendation network; s2, pooling the feature maps with different size candidate regions to the same size by using an unquantized pooling layer; and S3, inputting two detection branches to predict the category and position of each candidate area, and simultaneously connecting mask branches in parallel to perform foreground and background mask segmentation on each candidate area to restore the size of the original image. The invention solves the problem of pixel space information loss when pooling the feature maps of the candidate regions with different sizes in the prior art, uses the non-quantization pooling layer, and enables the feature image pixels to correspond to the original image pixels one by one under the condition of not introducing any parameter, thereby ensuring the accuracy of the target position and further improving the accuracy of image instance segmentation.

Description

Image instance segmentation method without quantization pooling

Technical Field

The invention relates to an image instance segmentation method, in particular to an image instance segmentation method without quantization pooling, and belongs to the field of computer vision two-dimensional image processing.

Background

In recent years, the development of the field of deep learning rapidly and directly pushes the field of image processing in computer vision to enter a new technical era, image instance segmentation is taken as a basic research problem, and the technology can be widely applied to the fields of automobile automatic driving, robot control, video monitoring and the like.

Image instance segmentation is considered a combination of object detection and semantic segmentation because it requires the correct separation of all instances in the image, while semantic segmentation at the pixel level assigns different class labels to each instance. Such as: all dogs are present in a picture, and different dogs are also specifically indicated, and are labeled as dog 1 and dog 2.

The image example segmentation method mainly comprises two stages and one stage, wherein the Mask R-CNN is the classic work of the two-stage method, is used as a baseline algorithm of many derivative applications, inherits the idea of two-stage target detection Faster R-CNN, and expands a full convolution network to predict target Mask information. Example segmentation is challenging in that pixel spatial position information is well utilized in processing images to distinguish objects in the same class of different examples in the graph. Different candidate region positions generated by a region recommendation network in Mask R-CNN are obtained by model regression, usually floating point numbers, when pooling is carried out, a bilinear interpolation method is used for obtaining pixel values of the floating point position, fixed 4 sampling points are required to be selected for solving the pixel value of a floating point coordinate, namely four integer pixel points of the upper, lower, left and right sides around the floating point coordinate, value fixing has no adaptability to different images, each pixel point in a feature map corresponding to a candidate region is not fully considered, the problem of quantization rounding exists indirectly, pixel space information is lost, pixel deviation exists to enable a target position to be inaccurate, a target Mask cannot be accurately segmented according to the position, and further the segmentation accuracy of a subsequent example is influenced.

Similar patents are not found through the search of the existing method. Therefore, aiming at the problem of pixel space information loss in the pooling process in the prior art, the image instance segmentation without quantization pooling is provided, which has a certain meaning for improving the segmentation accuracy.

Disclosure of Invention

The invention aims to provide a quantification-free pooling image instance segmentation method which is characterized in that complete pixel space information is reserved when candidate region feature maps with different sizes are pooled, the accuracy of target positions is ensured, instance segmentation accuracy is further improved, and particularly small image targets are obtained.

The purpose of the invention is realized as follows:

an image instance segmentation method without quantization pooling comprises the following steps:

s1: inputting any two-dimensional image into a deep layer feature extraction network to obtain a multilayer feature map, and extracting a candidate region through a region recommendation network; constructing a deep feature extraction network, fusing different feature layers by using a residual convolution network and a feature pyramid, and fully utilizing shallow position information and deep semantic information of the image; extracting a candidate region and distinguishing a foreground and a background by the regional recommendation network;

s2, pooling the feature maps with different size candidate regions to the same size by using an unquantized pooling layer;

s3, inputting two detection branches to predict the category and position of each candidate area, and simultaneously parallelly connecting mask branches to perform foreground and background mask segmentation on each candidate area to restore the size of the original image;

the invention also includes such features:

the construction of the deep feature extraction network is to use feature maps activated in 4 stages in 101 layers of residual convolutional networks to be fused with 4 layers of feature pyramids, and input two-dimensional images with any size into the deep feature extraction network to obtain 4 layers of feature maps with the same channel and different sizes;

the regional recommendation network is a convolution layer with a convolution kernel of 3 x 3 and a channel number of 256;

the foreground is a position with a target, and the background is a position without a target;

step S2 is to interpolate the candidate region discrete feature map to a continuous space using the quantization-free pooling layer, and directly calculate a double integral on the continuous feature map and divide the double integral by the area to obtain a mean value;

the two detection branches are multilayer full-connection classification and regression networks;

the mask branches into a full convolution network comprising 4 layers of convolution and 1 layer of deconvolution.

Compared with the prior art, the invention has the beneficial effects that:

1. the invention solves the problem of pixel space information loss when the feature maps of the candidate areas with different sizes are pooled, and better ensures the position accuracy of the target; by using the quantization-free pooling layer, the characteristic image pixels and the original image pixels are in one-to-one correspondence under the condition of not introducing any parameter, so that the accuracy of the target position is ensured, and the accuracy of image instance segmentation is further improved;

2. the method comprehensively improves the accuracy of two-dimensional image instance segmentation in a common scene, particularly the improvement of small targets.

Drawings

FIG. 1 is a flow chart of an example segmentation method for an image without quantization pooling of the present invention;

FIG. 2 is a schematic diagram of a deep feature extraction network of an image segmentation method without quantization pooling of the present invention;

FIG. 3 is a ROIAlign diagram of an example segmentation method for an image without quantization pooling of the present invention;

FIG. 4 is a diagram illustrating quantization-free pooling of an example segmentation method for an image according to the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.

The invention comprises the following steps:

and S1, inputting the two-dimensional image with any size into a deep layer feature extraction network to obtain a multilayer feature map, and extracting a candidate region through a region recommendation network.

And constructing a deep feature extraction network, fusing different feature layers by using a residual convolution network and a feature pyramid, and fully utilizing shallow position information and deep semantic information of the image. Extracting candidate regions by the regional recommendation network, and carrying out secondary classification (without distinguishing specific belonged categories) of foreground and background and fine adjustment of positions of the candidate regions by classification and regression branches;

and S2, pooling the candidate region feature maps with different sizes to the same size by using an unquantized pooling layer. Compared with the original Mask R-CNN pooling method, the method has the advantages that the double integral based on the continuous characteristic diagram is directly calculated by the non-quantization pooling layer, then the average value is obtained, and the pixel space information is fully reserved. It does not require the number of sampling points to be predefined, avoids any quantization of the coordinates, and has a continuous gradient over the candidate region coordinates.

And mapping the candidate area from the original image to the feature map according to the coordinate position, and mapping the discrete feature map to a continuous space by using an interpolation method. And performing double integration on the pixels of the continuous feature map to obtain an average value as a pooled pixel value.

And S3, inputting two detection branches to predict the category and the position of each candidate area, and simultaneously connecting the parallel mask branches with a full convolution network to perform foreground and background mask segmentation on each candidate area.

The method uses 135000 common scene pictures and corresponding real annotation files in an open source COCO data set as a training set, and 5000 pictures and corresponding real annotation files as a verification set and COCO segmentation indexes to evaluate the example segmentation accuracy.

Referring to the first embodiment, an embodiment of the image example segmentation method without quantization pooling in this patent is described, which specifically includes

S1: inputting any two-dimensional image into a deep layer feature extraction network to obtain a multilayer feature map, and extracting a candidate region through a region recommendation network.

S11: the two-dimensional image may be a three-channel RGB color image.

S12: construction of deep layerAnd (4) characterizing and extracting a network, referring to the second step, fusing different feature layers of high and low layers by using a 101-layer residual convolutional network and a 4-layer feature pyramid, and fully utilizing shallow position information and deep semantic information of the image. Residual convolutional network 4-stage activation feature map is { C₂，C₃，C₄，C₅(wherein C)₁Large-layer parameter amount occupied memory is not selected) is fused into the feature pyramid, and then the feature graph is { P }₂，P₃，P₄，P₅And the number of channels of all the characteristic maps is 256. In particular as C₅Dimension reduction to P by 1 x 1 convolution operation₅To P₅2 times up-sampling and C after 1 x 1 convolution operation₄Performing addition operation to generate P₄Other feature layers operate the same.

S13: and refining the characteristic through the regional recommendation network to generate a candidate region. The region recommendation network is a common method for generating candidate regions by preprocessing images in a target detection task, and is usually connected with convolution layers with convolution kernels of 3 x 3 and channel number of 256 after a feature extraction network. And connecting two 1-by-1 convolutions in parallel to form classification and regression branches, and performing two classifications (without distinguishing specific classes) of the foreground and background of the candidate region and fine adjustment of the position of the first candidate region.

S2: referring to fig. three, in Mask RCNN, in order to pool the feature maps corresponding to the candidate regions to the same scale, and send the feature maps into a subsequent full-connected classification and regression network, a roiign algorithm is used, and the brief method includes:

s21: and (3) obtaining a pixel point corresponding to the image pixel value with the floating point coordinate by using a bilinear interpolation method to supplement a missing coordinate point on the boundary of the candidate region, so that the pooling process of the candidate region is converted into a continuous operation, and the process can refer to the second step.

The bilinear interpolation method is related to, four surrounding pixel points are fixedly sampled for solving the pixel of one point, the value is fixed, the adaptability is not available, the pixel of each point in a characteristic diagram corresponding to a candidate region is not fully considered, the space position information of a target is lost due to the coordinate quantization problem, and the positioning and the segmentation of the subsequent target are seriously influenced.

S22: the non-quantization pooling layer firstly calculates double integral of corresponding characteristic regions and then takes an average value, thereby avoiding any quantization of coordinates, and having continuous gradual change on the boundary frame coordinates, so that the characteristic image pixels correspond to the original image pixels one by one, and the accuracy of the target position is ensured. With reference to fig. four, the specific algorithm includes:

firstly, mapping a candidate area from an original image to a feature map according to coordinate positions, cutting the feature map according to the area, namely a discrete feature map, carrying out interpolation processing on the feature map f of the candidate area in order to retain more detailed information, and mapping the discrete feature map to a continuous space to obtain a continuous feature map

The formula is as follows:

C(x，y，i，j)＝max(0，1-|x-i|)*max(0，1-|y-j|)

where (i, j) is the position coordinate on the discrete feature map, f (i, j) represents the pixel value of the corresponding coordinate, and (x, y) is the position coordinate on the continuous feature map,

and C (x, y, i, j) is an interpolation coefficient.

Assuming that the feature map size corresponding to the candidate region is (W, H), and the dimensionless pooled size is S × S, one pixel value needs to be selected from each cell g ═ W/S, H/S, to form a pixel map with S × S. If get (x)₁，y₁，x₂，y₂) A certain cell coordinate for the candidate region, wherein (x)₁，y₁) Is the coordinate of the upper left corner of the small grid, (x)₂，y₂) The coordinates of the upper right corner of the small grid are (x, y) are continuous coordinates after interpolation, and in order to fully utilize all pixel information on the characteristic diagram, the unquantized pooling layer performs double integration on the pixel value of each small grid g and then divides the pixel value by the area of the small grid to obtain an average valueBy the pixel value representing this cell, all cells are finally combined to obtain a continuous feature map, in the same way as above. The formula is as follows:

analytical formula (I)

Are continuously differentiable, so that the candidate regions can participate in the back propagation of the neural network, where x is the pair₁The partial derivative of (c) is calculated as follows:

for continuous functions, other coordinate partial derivatives can be calculated in the same way.

Through the improvement of the method, the continuity of the candidate area coordinates can be realized under the condition of introducing any parameter, the pixel space information is better reserved by considering all the surrounding pixels, the characteristic image pixels correspond to the pixels of the original image one by one, and the accuracy of the target position is ensured, particularly the small target in the image (for example, the deviation of 0.5 pixel points is negligible for the larger target, but the influence of the error is much higher for the small target).

S3: after quantization-free pooling, the candidate region feature map with the same size is input into two classifier branches, namely two multilayer fully-connected classification networks to predict the category and the position of each candidate region, the mask branch and the candidate region share the feature layer, the feature layer passes through the quantization-free pooling layer, and then four layers of fully-convolution networks are connected to realize foreground and background mask segmentation on each candidate region with a possible target, and finally the category type, the position coordinate and the segmentation mask of each example in the image are obtained.

According to the method, a comparison experiment is performed on the open source COCO data set, so that the image instance segmentation accuracy can be effectively improved. The training set is 135000 common scene pictures, and the verification set is 5000 pictures.

Accuracy evaluation used the common evaluation index AP (average of all IOU thresholds) of COCO data set₅₀,AP₇₅,APs,AP_M,AP_LThe results of the experiments performed on the validation set are as follows.

The example segmentation accuracy of the invention is higher than MaskRCNN under the same setting, especially in the image small target.

In summary, the following steps: the patent discloses a quantization-pooling-free image instance segmentation method which is mainly used for the problem of target instance segmentation of two-dimensional images in common scenes. The method mainly comprises the following steps: (1) inputting two-dimensional images with any size into a deep layer feature extraction network to obtain a multilayer feature map, extracting a candidate region through a region recommendation network, (2) pooling the feature map of the candidate region by using a non-quantization pooling layer, (3) inputting two detection branches to predict the category and the position of each candidate region, and simultaneously connecting mask branches in parallel to perform foreground and background mask segmentation on each candidate region. The invention solves the problem of pixel space information loss when pooling the feature maps of the candidate regions with different sizes in the prior art, uses the non-quantization pooling layer, and enables the feature image pixels to correspond to the original image pixels one by one under the condition of not introducing any parameter, thereby ensuring the accuracy of the target position and further improving the accuracy of image instance segmentation.

Claims

1. A quantification-free pooling image instance segmentation method is characterized by comprising the following steps:

and S3, inputting two detection branches to predict the category and position of each candidate area, and simultaneously connecting mask branches in parallel to perform foreground and background mask segmentation on each candidate area to restore the size of the original image.

2. The quantization-free pooling image instance segmentation method as claimed in claim 1, wherein the constructing of the deep feature extraction network is implemented by fusing 4 stages of activated feature maps in 101 layers of residual convolutional networks with 4 layers of feature pyramids, and inputting two-dimensional images with any size into the deep feature extraction network to obtain 4 layers of feature maps with same channels and different sizes.

3. The quantization-free pooled image instance segmentation method of claim 1 wherein the region recommendation network is a convolution layer with a convolution kernel of 3 x 3 and a channel number of 256.

4. The quantization-free pooled image instance segmentation method of claim 1 wherein the foreground is a location with an object and the background is a location without an object.

5. The method for image instance segmentation without quantization pooling of claim 1, wherein said step S2 is to use the quantization-free pooling layer to interpolate the candidate region discrete feature map into a continuous space, and directly calculate the double integral over the continuous feature map and divide the double integral by the area to obtain the mean value.

6. The quantization-free pooled image instance segmentation method of claim 1 wherein the two detection branches are multi-tiered fully-connected classification and regression networks.

7. The quantization-free pooled image instance segmentation method of claim 1 wherein the mask branches are full convolution networks comprising 4 layers of convolution and 1 layer of deconvolution.