CN113077484A

CN113077484A - Image instance segmentation method

Info

Publication number: CN113077484A
Application number: CN202110342899.5A
Authority: CN
Inventors: 张永生; 吕可枫; 于英; 汪汉云; 宋亮; 李力; 李磊; 闵杰; 张磊; 王自全
Original assignee: Information Engineering University of PLA Strategic Support Force
Current assignee: Information Engineering University of PLA Strategic Support Force
Priority date: 2021-03-30
Filing date: 2021-03-30
Publication date: 2021-07-06
Anticipated expiration: 2041-03-30
Also published as: CN113077484B

Abstract

The invention belongs to the technical field of image instance segmentation, and particularly relates to an image instance segmentation method. Firstly, inputting a target image into a constructed initial contour extraction model to obtain an initial contour of a target in the image; then sampling a plurality of contour points on the initial contour, and extracting a plurality of assumed deformation points around each contour point; then, by utilizing the constructed feature extraction model, the features of all contour points and the features of assumed deformation points around each contour point are obtained; and finally, inputting the obtained characteristics of each contour point and the characteristics of the assumed deformation points around each contour point into the constructed deformation contour extraction model to obtain the deformation result of each contour point, and further obtain the deformation contour of the target in the image. The method can obtain more precise segmentation results, so that the segmentation results are more precise, the edge contour precision is improved, and the method is suitable for tasks with higher requirements on precision.

Description

Image instance segmentation method

Technical Field

The invention belongs to the technical field of image instance segmentation, and particularly relates to an image instance segmentation method.

Background

Segmentation tasks are the basis of many image processing tasks, such as change detection, dynamic monitoring, classification tasks, etc., and the accuracy and efficiency of segmentation can affect subsequent work. Whereas example segmentation integrates three tasks, classification, localization and segmentation, which is one of the most difficult visual tasks compared to similar image processing tasks. From a procedural point of view, instance segmentation can be divided into two categories, detection-based methods and contour-based methods.

Firstly detecting an object based on a detection method, and then classifying the object pixel by a certain method to obtain a final example segmentation result. However, such pixel-based methods do not utilize edge information, which results in poor segmentation at the edges.

Another method is to achieve segmentation by obtaining the contour of the object. The contour-based method needs to solve the problem of how to measure the distance between an approximate contour formed by a point set and an actual contour, and accordingly, a reasonable learning network loss function is provided. The problems of initial position sampling of the point set, feature extraction of the points and the like also need to be solved. Quantitatively, the contour-based method can reduce the amount of determination required compared with the region-by-region pixel segmentation method, and a smaller detection amount means a smaller error rate, so the example contour-based segmentation method is also very worthy of study. Compared with the pixel-by-pixel classification method, the contour method has higher efficiency and speed, but has higher requirements on a training network and an initial contour.

For some specific image processing tasks (e.g., change detection using remote sensing images), higher image resolution and larger image range both place higher precision requirements on the segmentation effect. The existing example segmentation method has a good overall example segmentation effect on a public data set, but the precision of the existing example segmentation method on an object outline is not enough, and the existing example segmentation method is not applicable to some tasks with high requirements on fineness and is not enough to be used as a fine segmentation result to guide subsequent works such as change detection, example positioning and the like.

Disclosure of Invention

The invention provides an image instance segmentation method, which is used for solving the problem that the contour extracted by adopting the prior art is low in precision.

In order to solve the technical problems, the technical scheme and the corresponding beneficial effects of the technical scheme are as follows:

the invention provides an image instance segmentation method, which comprises the following steps:

1) inputting a target image into the constructed initial contour extraction model to obtain an initial contour of a target in the image; and samples a number of contour points on the initial contour,

2) extracting a plurality of assumed deformation points around each contour point;

3) obtaining the characteristics of each contour point and the characteristics of assumed deformation points around each contour point by using the constructed characteristic extraction model;

4) and inputting the obtained features of the contour points and the features of the assumed deformation points around each contour point into the constructed deformation contour extraction model to obtain the deformation result of each contour point, and further obtain the deformation contour of the target in the image so as to perform optimized segmentation edge processing on the initial contour.

The beneficial effects of the above technical scheme are: after the initial contour of the target is obtained, a plurality of assumed deformation points are extracted around the contour points, the characteristics of each contour point and the assumed deformation point network around each contour point are determined and input into the constructed deformed contour extraction model, and the deformation result of each contour point can be obtained.

Further, in order to adaptively select the scale of the assumed deformation point network to achieve the effect of better adapting to the initial contour, a contour point and a plurality of assumed deformation points extracted around the contour point form an assumed deformation point network, and the size of the assumed deformation point network is as follows:

S＝[S₀+log₂(wh/WH)]

wherein, the [ alpha ], [ beta ]]Representing rounding processing; s is the size of the assumed deformation point network; s₀The hierarchy where the object with the area of W × H is located; w and h are the width and height of the detection frame respectively; w and H are the width and height of the target image, respectively.

Further, the feature extraction model is a ResNet-FPN model.

Further, the means for obtaining the features of the contour points and the features of the assumed deformation points around each contour point by using the ResNet-FPN model is as follows: firstly, inputting a target image into a ResNet-FPN model to obtain a plurality of characteristic graphs; and obtaining the characteristics of each contour point and each assumed deformation point from each characteristic diagram by utilizing an interpolation method.

Further, in order to adaptively select feature maps to adapt to target example segmentation tasks with different sizes, the number of the obtained feature maps PX is 3 and 3 adjacent feature maps PX are respectively a first feature map P (K-1), a second feature map PK and a third feature map P (K + 1); wherein, the smaller the value of X, the lower the convolution layer is used for processing to obtain the corresponding characteristic diagram; the value of K is:

K＝[K₀+log₂(wh/WH)]

wherein, the [ alpha ], [ beta ]]Representing rounding processing; w and h are the width and height of the detection frame respectively; w and H are mesh respectivelyWidth and height of the target image; k₀The level of the object with the area of W × H.

Further, in the step 1), the initial contour extraction model is a Mask R-CNN model.

Further, in step 4), the deformation contour extraction model is a graph convolution neural network model.

Further, when the constructed initial contour extraction model, the constructed feature extraction model and the constructed deformed contour extraction model are trained, the loss function L is as follows:

wherein, Smooth_L1Express Smooth_L1A loss function; l is_{poi_n}A loss function for each contour point is represented,

in order to predict the coordinates of the point,

coordinates of real points; l is_{con_n}Representing the loss function for the entire contour, N is the total number of contour points,

as coordinates of predicted contour points, x_iCoordinates of real contour points; n is the number of iterations; l is_{Mask_n}Is the average binary cross entropy loss of all pixels within the detection box.

Further, in order to ensure that the efficiency and the effect reach the optimal balance, the number of repeated iterations is required to be 3.

Drawings

FIG. 1 is a flow chart of an example segmentation method of an image of the present invention;

FIG. 2 is a schematic view of a hypothetical network of inflection points of different sizes of the present invention;

FIG. 3 is a schematic diagram of a graph convolution warping process of the present invention.

Detailed Description

The basic concept of the invention is as follows: aiming at the problem that the method in the prior art is not suitable for tasks with higher requirements on fineness, the invention carries out deformation processing on the outline on the basis of the segmentation result obtained by Mask R-CNN to obtain a more accurate segmentation result. The specific deformation treatment process comprises the following steps: firstly, representing a contour by using a sufficient number of contour points, and designing N assumed deformation points for each contour point, wherein the N assumed deformation points and the corresponding contour points form a plane network; then, after the characteristics of the points are obtained, the characteristics are input into a graph convolution neural network (GCN) as a local graph to obtain the weight of each point; and then, carrying out weighted summation operation on the weights of all the points to obtain a deformation result of each contour point, thereby obtaining a final deformation contour.

The following describes the image example segmentation method of the present invention in detail with reference to the flowchart of fig. 1.

Step one, inputting a target image (Images) into a constructed Initial Contour extraction model (Mask R-CNN model) to obtain an Initial Contour (Initial Contour) of a target in the image.

And step two, sampling a plurality of contour points enough to represent the contour on the initial contour, and extracting N assumed deformation points around each contour point.

The graph refers to a topological graph in mathematics, wherein corresponding relations are established by using vertexes and edges, and the essential purpose of the GCN is to extract spatial features of the topological graph and use the spatial features for learning. In this step, the present invention designs a hypothetical deformed point network for each point sampled from the contour, and treats the hypothetical deformed point network as a graph. When a hypothetical deformation point network is selected, the number of vector directions needs to meet the deformation requirement on the premise of ensuring the local point density.

As shown in fig. 2, it is assumed that a deformed Point network (Local Point-Net) is composed of 21 points, and the points form a network, and together form 48 edges, and have a certain topological relationship. The assumed deformation point network has 16 different vector directions, ensures the omnibearing deformation requirement, has layering sense and can meet different deformation sizes. It is assumed that a network of deformed points is used to predict the direction and magnitude of the deformation of the sampled contour points and that the relative orientation of each point involved is determined. In this respect, the present invention designs an adaptive scaling strategy to adapt to contour deformation of objects of different sizes, assuming that the selected dimensions of the deformed point network are:

S＝[S₀+log₂(wh/WH)] (1)

wherein, the [ alpha ], [ beta ]]Representing rounding processing; s is the size of the assumed deformation point network; s₀The hierarchy level where the object with the area of W × H is located is set to 20 pixels in this embodiment; w and h are the width and height of the detection frame respectively; w and H are the width and height of the target image, respectively.

And step three, after the contour points and the assumed deformation points of each contour point are obtained, the features of each contour point and the features of the assumed deformation points around each contour point are obtained by utilizing the constructed feature extraction model (ResNet-FPN model). Specifically, the method comprises the following steps:

1. and (3) putting the target Images (Images) into a ResNet-FPN model to obtain 3 adjacent feature maps.

Mask R-CNN uses a ResNet-FPN model for feature extraction, and the model is also used for feature extraction in the embodiment. FPN (feature pyramid network) is a well-designed multi-scale detection method, which comprises three parts of bottom-up, top-down and transverse connection, and the structure can fuse the features of all levels to make the features have strong semantic information and strong spatial information simultaneously. As shown in fig. 1, in the bottom-up process, ResNet is used as a skeleton network, and is divided into 5 layers according to the size of a feature map, and the output results of each layer are ConV1, ConV2, ConV3, ConV4 and ConV 5; the top-down process is to sample the ConV feature map by using a nearest neighbor method to obtain M2, M3, M4 and M5, wherein M6 is obtained by M5 downsampling; the transverse connection process is to fuse the M layers and the ConV layers, eliminate aliasing effect of up-sampling through 3-by-3 convolution kernel processing, and finally obtain P2, P3, P4, P5 and P6. As can be seen from the figure, P2 is obtained by the fusion process of ConV2 and M3 and the 3 × 3 convolution kernel process, and P3 is obtained by the fusion process of ConV3 and M4 and the 3 × 3 convolution kernel process, that is, the smaller X in the feature map PX indicates that PX is obtained by the lower convolution layer.

Assuming that (H, W) is the height and width of the target image, the feature maps P2, P3, P4, P5 and P6 correspond to the sizes of (H/4, W/4), (H/8, W/8), (H/16, W/16), (H/32, W/32) and (H/64, W/64) in sequence, and the number of channels is 256. And taking [ P2, P3, P4 and P5] as candidate feature maps, and selecting the feature maps according to the sizes of the segmentation objects in different tasks. For example, in the task of segmenting the small target instances, a feature map with higher resolution should be selected, for example, the feature map with higher resolution is selected to improve the segmentation accuracy [ P2, P3, P4], whereas for the task of segmenting the large target instances, a feature map with relatively lower resolution may be selected to improve the segmentation efficiency [ P3, P4, P5 ]. In this embodiment, the K value calculated by the following formula is used to adaptively select a feature map [ P (K-1), PK, P (K +1) ]:

K＝[K₀+log₂(wh/WH)] (2)

wherein, the [ alpha ], [ beta ]]Representing rounding processing; w and h are the width and height of the detection frame respectively; w and H are the width and height of the target image respectively; k₀The hierarchy level where the object with the area W × H should be located is set to 5 in this embodiment.

2. Because the feature map cannot contain all points (including contour points and corresponding assumed deformation points), after the feature map is selected, the feature corresponding to each point is obtained from each feature map by using a bilinear interpolation method according to the coordinate information of the contour points and the assumed deformation points, and the feature map is connected.

In the neural network, as the network deepens, each layer loses some information, including the feature pyramid idea in the FPN, and the multi-scale multi-layer feature fusion can retain more information. Because of its simplicity and ease of operation, feature linking has been widely used as a loss reduction method. Since the number of channels of the previously selected feature maps P2-P5 is 256, the features of each point are summed, and the number of channels is kept unchanged at 256. In summary, the feature dimension of each contour point is 21 × 256 by feature extraction and feature connection.

And step four, inputting the obtained characteristics of each contour point and the characteristics of the assumed deformation points around each contour point into a constructed deformation contour extraction model (GCN model) to obtain a deformation result (Refine Mask) of each contour point, and further obtaining a deformation contour of the target in the image so as to perform optimized segmentation edge processing on the initial contour. Specifically, the method comprises the following steps:

after obtaining the features corresponding to the contour points and the assumed deformation points, it is necessary to use these features to infer the deformation information of the contour points from the respective assumed deformation points. As illustrated in fig. 3, 128 contour points are chosen for each initial mask in order to be able to represent the shape of most objects. The coordinate point dimension as input is therefore 128 × 21 × 2. After the features are obtained, they are input into the GCN to calculate the weights of the various assumed deformed points.

In this embodiment, a scoring graph neural network consisting of 5 graph convolution layers + ReLU is used to predict the weight of each hypothesis deformation point. And then obtaining the deformation vector information of the contour point through a Softmax layer and weighted summation operation, and further obtaining a deformation result. The GCN is adopted to realize the information transmission of the multi-order neighborhood by superposing a plurality of convolution layers, and the topological information of the local deformation point network is fully utilized.

In addition, considering that a good effect is difficult to achieve only by one-time deformation and the quality is poor, in the embodiment, a final deformation result is obtained by multiple times of iterative deformation in an iterative optimization manner. That is, the deformed contour of the target in the image obtained as the result of the first iteration is used as the initial contour when the second iteration is performed, the above-described steps two to four are repeated to obtain the deformed result of the second iteration, and the result obtained by the second iteration is used as the final deformed result. It should be noted that the quality here refers to the refinement degree of the segmentation, and what the method of the present invention improves is the refinement degree of the edge, which is expressed as that the segmentation result is closer to the target at the edge degree. The used quality evaluation index is a standard Mask AP (average of IoU thresholds) value, wherein AP refers to the average value of AP50, AP55, AP60, … and AP95, and the figure represents an IOU threshold value and is the evaluation index of COCO data set standard.

It should be noted that the initial contour extraction model, the feature extraction model, and the deformed contour extraction model are trained together during training. In the training process, two loss functions are selected, one is Smooth_L1And the other is to inherit the Mask loss function in Mask R-CNN.

For each contour point, a loss function is defined as follows:

wherein n is the number of iterations;

in order to predict the coordinates of the point,

is the coordinate of the real point.

Summarizing in an iterative process, for the entire contour, a loss function is defined as follows:

wherein N is the total number of contour points,

to be driven from

The resulting coordinates are resampled.

L inherited from Mask R-CNN_MaskIs the average binary cross-entropy loss (the average binary cross-entropy) of all pixels within the detection box.

Finally defining a multitask loss function as:

in this embodiment, the initial contour extraction model is Mask R-CNN model, and the feature extraction model is ResNet-FPN model. As other embodiments, other models in the prior art may be selected for performing corresponding initial contour extraction and feature extraction operations, for example, the initial contour extraction model may be a PointRend model, a Cascade Mask R-CNN model, or the like, and the feature extraction model may be a resenext model, or the like.

In this embodiment, two iterations are performed and the final deformation result is used as the deformation contour of the target in the image. As another embodiment, if one iteration has a better effect, the second iteration may not be performed to improve the calculation efficiency. Of course, if the two iterations are still poor, a third iteration may be performed, but some computational efficiency must be sacrificed.

The invention has the following characteristics:

1. according to the method, N assumed deformation points are extracted around the contour point, one contour point and the characteristics of the N assumed deformation points around the contour point are extracted and input into the constructed deformation contour extraction model, so that the deformation result of each contour point can be obtained, and further the deformation contour of the target in the image can be obtained, and the contour is deformed on the basis of the segmentation result obtained by using the initial contour extraction model, so that a more precise segmentation result can be obtained, the edge contour precision is improved, and the method is suitable for tasks with higher requirements on the precision degree.

2. The initial contour extraction model adopts a Mask R-CNN model, the feature extraction model adopts a ResNet-FPN model, and each point inherits the features of the corresponding position of the feature map in the Mask R-CNN model network architecture, so that the deformation result can be better adapted to the initial contour.

3. Because the deformation degrees between the initial contour and the real contour of the target object with different sizes are different, when the assumed deformation point network is designed, the assumed deformation point network with the self-adaptive scale is designed, specifically see formula (1), so that the size of the assumed deformation point network can be adaptively adjusted according to the size of the target image and the size of the target object, and the effect of better adapting to the initial contour is achieved.

4. When the feature map is selected, a method for automatically selecting the feature layer combination is adopted, specifically, see formula (2), so that the feature layer combination can be adaptively selected according to the size of a target image and the size of a target object, the segmentation precision is improved by adaptively selecting a shallow network for a small target instance segmentation task, and the segmentation efficiency is improved by adaptively selecting a deep network for a large target instance segmentation task.

5. An iterative method is used, and a more refined segmentation result is obtained through multiple times of deformation.

Claims

1. An image instance segmentation method is characterized by comprising the following steps:

1) inputting a target image into the constructed initial contour extraction model to obtain an initial contour of a target in the image; sampling a plurality of contour points on the initial contour;

4) inputting the obtained characteristics of each contour point and the characteristics of assumed deformation points around each contour point into a constructed deformation contour extraction model to obtain a deformation result of each contour point, further obtaining a deformation contour of a target in an image, judging whether the quality of the deformation contour meets requirements or not, if not, sampling a plurality of contour points on the deformation contour, and repeating the steps 2) to 4) until the quality of the finally obtained deformation contour meets the requirements or the number of repeated iterations meets the requirements.

2. The image instance segmentation method according to claim 1, wherein a contour point and a plurality of assumed deformation points extracted around the contour point form an assumed deformation point network, and the size of the assumed deformation point network is:

S＝[S₀+log₂(wh/WH)]

3. An image instance segmentation method according to claim 1, characterized in that the feature extraction model is the ResNet-FPN model.

4. An image instance segmentation method as claimed in claim 3, characterized in that the means for deriving the features of the respective contour points and the features of the assumed deformation points around each contour point by means of the ResNet-FPN model are: firstly, inputting a target image into a ResNet-FPN model to obtain a plurality of characteristic graphs; and obtaining the characteristics of each contour point and each assumed deformation point from each characteristic diagram by utilizing an interpolation method.

5. The image example division method according to claim 4, wherein the number of the obtained feature maps PX is 3 and 3 adjacent feature maps P (K-1), PK, and P (K + 1); wherein, the smaller the value of X, the lower the convolution layer is used for processing to obtain the corresponding characteristic diagram; the value of K is:

K＝[K₀+log₂(wh/WH)]

wherein, the [ alpha ], [ beta ]]Representing rounding processing; w and h are the width and height of the detection frame respectively; w and H are the width and height of the target image respectively; k₀The level of the object with the area of W × H.

6. The image instance segmentation method according to claim 1, wherein in step 1), the initial contour extraction model is a Mask R-CNN model.

7. The image instance segmentation method according to claim 1, wherein in step 4), the deformation contour extraction model is a atlas neural network model.

8. The image instance segmentation method according to any one of claims 1 to 7, wherein a loss function L adopted in training the constructed initial contour extraction model, the constructed feature extraction model and the constructed deformed contour extraction model is as follows:

in order to predict the coordinates of the point,

9. The image instance segmentation method according to claim 1, wherein the requirement for the number of repeated iterations is 3.