CN115100410A

CN115100410A - Real-time instance segmentation method integrating sparse framework and spatial attention

Info

Publication number: CN115100410A
Application number: CN202210803057.XA
Authority: CN
Inventors: 刘盛; 陈俊皓; 张峰; 郭炳男; 陈瑞祥
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2022-07-07
Filing date: 2022-07-07
Publication date: 2022-09-23

Abstract

The invention discloses a real-time instance segmentation method fusing a sparse frame and spatial attention, which comprises the steps of firstly obtaining an image to be processed, inputting the image to a feature extraction network to extract a multi-scale feature map, and inputting the multi-scale feature map to a feature enhancement network to obtain an enhanced feature map; inputting the enhanced feature map into an example branch to obtain a target frame and a target classification, splicing the enhanced feature map and the target frame output by the example branch, inputting the spliced enhanced feature map and the target frame output by the example branch into a mask branch, performing convolution operation, then respectively performing a spatial attention module and a mask kernel generation module to obtain a spatial attention feature and a mask kernel, and multiplying the spatial attention feature and the mask kernel to obtain a segmentation mask; and finally, mapping the segmentation mask, the target frame and the target to the image to be processed in a classified manner to obtain an example segmentation result. The invention improves the speed and the precision of the example segmentation task, and can obtain real-time and accurate segmentation results when continuous video frames are input.

Description

Real-time instance segmentation method integrating sparse framework and spatial attention

Technical Field

The application belongs to the technical field of instance segmentation, and particularly relates to a real-time instance segmentation method fusing a sparse frame and space attention.

Background

Example segmentation has become one of the more important, complex and challenging areas in machine vision research. It is helpful for many downstream tasks including crowd detection, autopilot, video surveillance, etc.

The purpose of target detection is to detect the category of an image target and to give the position of the classified image target in the form of a bounding box or a center; the semantic segmentation aims to predict the label of each pixel point in the image to obtain an accurate reasoning result, and each pixel is classified and marked according to the object or the area where the pixel is located. Instance segmentation segments objects in the image (generates a mask) and provides different labels for different object instances belonging to the same object class. Thus, instance segmentation may be defined as solving the task of object detection while solving semantic segmentation, decomposing each segmented object into its respective subcomponents.

One classic and effective example segmentation idea in the prior art is based on a suggestion box generation algorithm, which uses a suggestion box to detect an object and then segments the suggestion box. At present, three common suggested box detectors are respectively a dense detector, a dense-Sparse detector and a Sparse detector, and common example segmentation algorithm models are Mask RCNN, Polar Mask and Sparse RCNN.

Mask RCNN is a typical top-down detection method, adding a new branch at the end of the dense-sparse detector fast RCNN, segmenting features after ROI Align (a feature alignment method). The method obeys a paradigm of first detection and then segmentation, a dense prior target frame is generated by a model through an RPN, and then a mask segmentation head is used for carrying out front and back background segmentation on the detected target frame. This approach can effectively eliminate the influence of background on segmentation, but it is sensitive to the performance of the detector and requires that the detection performance in the first step is good enough.

Polar Mask is improved on dense detector FCOS, and unifies the example segmentation into a form of full convolution network, that is, the non-convolution operation of ROI Align is not used, and Polar Mask models a Mask from a binary image into an image consisting of several rays under a Polar coordinate system, thereby realizing a single-stage example segmentation framework.

The greatest characteristic of Sparse RCNN is sparsity of the whole process of target detection, and input is a set of Sparse suggestion boxes, suggestion features and one-to-one interactivity. The whole detection process has no dense suggestion area and dense (global) characteristics, avoids the design of a large number of prior frames and the many-to-one mapping of the prior frames and real frames, has better effect than the generation algorithm of dense suggestion frames, does not need post-processing operations such as non-maximum suppression and the like, and realizes a complete end-to-end target detector.

Most of the previous target detectors are dense detectors, such as Mask RCNN, namely, the target detectors are established on the basis of dense suggestions, are preset on an image grid or a characteristic graph grid in advance, then the scores and the offsets of the suggestions are predicted, the scores and the offsets are judged through an IOU (input output Unit), and then the scores and the offsets are screened through an NMS (network management System); a small part is a dense-sparse detector, such as Polar Mask, which first extracts fewer (sparse) foreground frames, i.e., region candidate frames, from the dense proposed region, and then classifies and regresses the position for each region candidate frame, removing from thousands of candidates to a few foreground. In both of the above methods, since each candidate block is passed through the convolutional neural network alone, and there are also tedious a priori block design and post-processing operations, it takes much time.

Disclosure of Invention

The method aims to provide a real-time instance segmentation method fusing a sparse frame and space attention, avoids a large number of complex post-processing operations such as design of a priori frame and non-maximum suppression, and achieves rapid reasoning.

In order to achieve the purpose, the technical scheme of the application is as follows:

a real-time instance segmentation method fusing a sparse framework and spatial attention is characterized in that an image is subjected to instance segmentation by adopting a constructed instance segmentation network, the instance segmentation network comprises a feature extraction network, a feature enhancement network, a mask branch and an instance branch, and the real-time instance segmentation method fusing the sparse framework and the spatial attention comprises the following steps:

acquiring an image to be processed, inputting the image to a feature extraction network, and extracting a multi-scale feature map;

inputting the multi-scale feature map into a feature enhancement network to obtain an enhanced feature map;

inputting the enhanced feature map into the example branch to obtain a target frame and a target classification;

splicing the enhanced feature graph and a target frame output by the example branch, inputting the spliced enhanced feature graph and the target frame into a mask branch, performing convolution operation, then respectively performing a space attention module and a mask kernel generation module to obtain a space attention feature and a mask kernel, and multiplying the space attention feature and the mask kernel to obtain a segmentation mask;

and mapping the segmentation mask, the target frame and the target to the image to be processed in a classified manner to obtain an example segmentation result.

Further, the feature extraction network adopts ResNet50 to take the outputs of residual modules 3, 4 and 5 in ResNet50 as the extracted third scale feature map, second scale feature and first scale feature map;

inputting the multi-scale feature map into a feature enhancement network to obtain an enhanced feature map, wherein the method comprises the following steps:

inputting the first scale feature map into a pyramid pooling module, and outputting a first feature map;

adding elements of the second scale characteristic diagram and the first characteristic diagram to obtain a second characteristic diagram;

adding elements of the third scale characteristic diagram and the second characteristic diagram to obtain a third characteristic diagram;

and respectively carrying out convolution operation on the first characteristic diagram, the second characteristic diagram and the third characteristic diagram, and then splicing to obtain an enhanced characteristic diagram.

Further, the inputting the enhanced feature map into the instance branch to obtain the target box and the target classification includes:

initializing a suggestion box and suggestion features;

the enhanced feature map is focused through regional features to extract interesting features;

and inputting the interested features and the suggested features into a dynamic convolution head to generate a final target frame and a target classification.

Further, the spatial attention module performs the following operations:

performing maximum pooling and average pooling on the input feature map in channel dimension to obtain pooled feature P _max And P _avg Then P is added _max And P _avg After dot product is carried out, a convolution of 3 multiplied by 3 is carried out, then sigmoid operation is carried out, and multiplication is carried out on the sigmoid operation and the original input feature graph to obtain the space attention feature.

Further, the mask kernel generation module performs the following operations:

dimension is adjusted through linear layers on the input feature graph, and a 128-dimensional mask kernel is generated.

Further, when training the example segmentation network, the inputting the enhanced feature map into the example branch to obtain the target box and the target classification further includes:

and introducing an intersection ratio IoU to adjust the target classification probability, wherein the formula is as follows:

wherein the content of the first and second substances,

for the adjusted target classification probability, S _i Denotes the intersection ratio IoU, P _i And representing the classification probability of the target corresponding to the target box i.

Further, when training the example segmentation network, the overall loss function of the network is as follows:

wherein the content of the first and second substances,

which represents the loss of the classification of the object,

it is indicated that the mask is lost,

indicating frame loss, λ _cls Is a weight coefficient;

wherein the content of the first and second substances,

and

for the dice loss and the pixel level binary cross entropy loss, λ _dice And λ _pix Is the corresponding weight coefficient.

The method improves the existing characteristic extraction and enhancement method, improves the efficiency of the stage, avoids the design of a large number of prior frames and the many-to-one mapping of the prior frames and real frames based on a pure sparse target detection algorithm, does not need complex post-processing operations such as non-maximum suppression and the like, and improves the attention of a network to example characteristics by fusing a space attention module. The method and the device improve the speed and the precision of the example segmentation task, and can obtain real-time and accurate segmentation results when continuous video frames are input.

Drawings

FIG. 1 is a flow chart of an example segmentation method of the present application;

FIG. 2 is a schematic diagram of an example split network architecture of the present application;

FIG. 3 is a schematic diagram of a pyramid pooling module;

FIG. 4 is a schematic diagram of a dynamic convolution head;

fig. 5 is a schematic view of a spatial attention module.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In one embodiment, as shown in fig. 1, a real-time instance segmentation method fusing a sparse framework and spatial attention is provided, and an image is subjected to instance segmentation by using a constructed instance segmentation network, wherein the instance segmentation network comprises a feature extraction network, a feature enhancement network, a mask branch and an instance branch. The real-time instance segmentation method fusing the sparse framework and the spatial attention comprises the following steps:

and step S1, acquiring the image to be processed, inputting the image to the feature extraction network, and extracting the multi-scale feature map.

In the embodiment, a to-be-processed Image is given to be in the range of R ^3×H×W Where 3 represents the RGB channel and H and W represent the height and width of the image, are preprocessed.

In network model training and practical application, the preprocessing mode of pictures is different. In the training process, in order to enhance the generalization performance of the model, the pictures need to be subjected to data augmentation, and the pictures are firstly turned left and right with a probability of 0.5, so that the data amount of the training set is increased by one time. Then, the picture is randomly cut, and then the picture is zoomed to a certain size. Since the input picture has a pixel range of 0 to 255, and training is unstable in this range, the pixel values of the picture need to be scaled to 0 to 1 in an equal ratio, and therefore the preprocessing needs to further normalize the pixel values of the image.

However, in the practical application after the network model is trained, data amplification is not needed to be carried out on the pictures, and only inference input consistent with that in the training process is needed. Specifically, the image to be processed only needs to be scaled and normalized (random flipping and cropping are not needed).

The present embodiment adopts ResNet50 as a feature extraction network (also referred to as backbone network backbone), and inputs the preprocessed pictures into ResNet50 to extract image features. The feature extraction network extracts multi-scale feature maps (res3, res4 and res5), and the output of the multi-scale feature maps is a series of feature maps with different sizes

Therein

Is (512,1024,2048) in the following order,

corresponding to 1/8, 1/16 and 1/32 of original height, and the same principle

It corresponds to the width of the original, in the same proportion as the height, and the different feature maps obtained here are taken as the input of the next part.

And step S2, inputting the multi-scale feature map into the feature enhancement network to obtain an enhanced feature map.

In the step, the multi-scale feature map is transmitted into a feature enhancement network, which is also called a hack network, so as to obtain an enhanced feature map.

In a specific embodiment, the feature enhancement network performs the following operations:

element addition is carried out on the second scale characteristic diagram and the first characteristic diagram to obtain a second characteristic diagram;

As shown in fig. 2, the input image to be processed passes through ResNet50, and the outputs res3, res4 and res5 of the residual modules 3, 4 and 5 in the ResNet50 are respectively referred to as a third scale feature map, a second scale feature and a first scale feature map; and then input to a feature enhancement network (neck). In the feature enhancement network, res5 is input to a Pyramid Pooling Module (PPM) to obtain a first feature map, res4 is added to the first feature map to obtain a second feature map, and res3 is added to the second feature map to obtain a third feature map; and then the first characteristic diagram, the second characteristic diagram and the third characteristic diagram are spliced after convolution operation respectively to obtain an enhanced characteristic diagram.

Wherein res5 is input to a Pyramid Pooling Module (PPM) to obtain a first feature map, which can enlarge the receptive field. Pyramid pooling module as shown in fig. 3, the input features of the pyramid pooling module pass through 4 layers of pyramid pooling modules (pool) from the input side to the output side, the dimensions of each layer being 1 × 1,2 × 2,3 × 3,6 × 6, respectively. Pooling the feature maps to target sizes, respectively, and then performing a 1 × 1 convolution (conv) on the pooled results to reduce the channels to 1/N, where N is 4. Then, each feature map of the previous step is subjected to bilinear interpolation upsampling (upsampling) to obtain the same size of the original feature map, and then the original feature map and the feature map obtained by upsampling are connected according to channel dimensions (concat). The obtained channel number is twice of the original feature map channel number, and finally the 1 multiplied by 1 convolution is used for reducing the channel to the original channel. The final feature map is the same size and channel as the original feature map.

And step S3, inputting the enhanced feature map into the example branch to obtain the target frame and the target classification.

The embodiment branches, and includes the following specific operations:

step S31, initialize the suggestion box and suggest features.

Before an enhanced feature graph input Instance Branch (Instance Branch), a set of learnable parameters N × 4 representing the suggestion boxes init _ bboxes is initialized, N being a hyper-parameter representing the number of initial suggestion boxes, and a set of learnable suggestion features corresponding to the suggestion boxes, N × d, are initialized. Namely, N suggestion boxes and N corresponding suggestion features are obtained through initialization.

And step S32, extracting interesting features from the enhanced feature map through regional feature focusing.

The embodiment extracts the interesting features ROI _ features by focusing the region features (ROI Align) on the enhanced feature map P. Since there are N suggestion boxes initialized, N features of interest can be obtained. ROI Align is a well established technique in the art and will not be described in detail herein.

And step S33, inputting the interested features and the suggested features into a dynamic convolution head, and generating a final target frame and a target classification.

In this embodiment, as shown in fig. 4, a Dynamic instance interactive head (Dynamic instance interactive head) executes sequential iteration of k Dynamic convolution layers, where k is a hyper-parameter, that is, a preset number of layers of Dynamic convolution iteration.

The inputs to the dynamic convolution header are the features of interest roi _ features and the proposed features pro-visual _ features, which can be considered as a mechanism of attention, the proposed features generate the parameters of the convolution kernel and then act on the features of interest to obtain the final prediction result.

And the output characteristic and the output suggestion frame of the last dynamic convolution layer are used as the suggestion characteristic and the suggestion frame of the next dynamic convolution layer, and the final target frame and the target classification are generated through multiple iterations.

It should be noted that, in this embodiment, each feature of interest and the corresponding proposed feature are sequentially subjected to dynamic convolution, so that all the target frames and the target classifications are obtained.

In the embodiment, a set of learnable parameters N × 4 is initialized to represent a suggestion box, where the learned suggestion box can be understood as a statistical value of positions where objects may appear in an image, and a set of learnable suggestion features corresponding to the suggestion box, with a size of N × d, are initialized to characterize target features, and the number of the suggestion features corresponds to the suggestion box. ROI Align provides interesting features, the mode can avoid losing information of an original feature map in the process, the maximum information integrity is guaranteed by non-quantization in the whole process of the intermediate process, then the obtained features are sent to a dynamic convolution head, and the input of the detection head is the interesting features and the suggested features. For each feature of interest, a suggested feature is associated, and the final output feature is obtained. The proposed feature may be viewed as an attention mechanism, where the proposed feature generates parameters (params) of a convolution kernel, which is then applied to the feature of interest to obtain a final prediction result. And the output characteristic and the output suggestion box of the previous dynamic convolution are used as the suggestion characteristic and the suggestion box of the next dynamic convolution, and the final target box and the target classification are generated through multiple iterations. Dynamic convolution headers are well established in the art and will not be described in detail herein.

And step S4, splicing the enhanced feature map and the target frame output by the example branch, inputting the spliced enhanced feature map and the target frame into a mask branch, firstly performing convolution operation, then respectively performing a spatial attention module and a mask kernel generation module to obtain a spatial attention feature and a mask kernel, and multiplying the spatial attention feature and the mask kernel to obtain a segmentation mask.

In this embodiment, the enhanced feature map is spliced with the target frame output by the example branch and then input to the mask branch, and the convolution operation is performed first. That is, before the enhanced feature map is input into a Mask Branch (Mask Branch), the spatiality of the feature map is increased, and the normalized coordinate information added in the first convolution can be represented by the following formula:

mask_features＝(mask_convs(cat(coord_features,P)))

where mask _ convs represents 4 convolutions of 3 × 3, cat represents the procedure of controlling, coord _ features represents normalized coordinate information, P represents a feature map of output enhancement, and mask _ features represents a feature map obtained by 4 convolutions of 3 × 3.

It should be noted that the normalized coordinate information of the present embodiment comes from the target box of the example branch output.

The obtained mask _ features feature map is then input to a SAM module (spatial attention module) and a mask kernel generation module (mask kernel), respectively.

As shown in FIG. 5, the SAM module maps the input feature maps, masks _ features ∈ R ^C×H×W Performing maximum pooling (Max Pool) and average pooling (Avg Pool) on the input feature map in channel dimension to obtain pooled feature P _max ,P _avg ∈R ^1×H×W Then P is added _max And P _avg After dot product, a 3 × 3 convolution (recovery channel) is performed, then sigmoid operation is performed, and then multiplication (element-wise) is performed on the sigmoid operation and the original input to obtain the spatial attention feature.

Specifically, the following formula is adopted to represent the corresponding operation of the SAM module:

wherein A is _sam (mask _ features) represents the feature after passing the sigmoid, σ represents the sigmoid function, and σ represents the dot product.

Wherein the spatial _ attitude _ features _sam Representing the spatial attention feature that the SAM module finally outputs,

denotes the operation of multiplication (multiplication).

In the mask kernel module of this embodiment, dimensions of an input feature map mask _ features are adjusted through a linear layer, so as to generate a 128-dimensional mask kernel. The mask kernel module is formulated as follows:

w＝Linear _256->128 (mask_features)

wherein w ∈ R ^1×C Representing a mask kernel.

Finally, multiplying the space attention feature by a mask kernel to obtain a division mask output by a mask branch, and expressing the division mask by a formula as follows:

where pred _ mask is the split mask of the mask branch output, w ∈ R ^1×C Indicates the mask kernel, spatial _ attribute _ features _sam ∈R ^C×H×W Is a spatial attention feature.

And step S5, mapping the segmentation mask, the target frame and the target classification to the image to be processed to obtain an example segmentation result.

Through the steps, the segmentation mask, the target frame and the target classification are obtained respectively, and finally the segmentation mask, the target frame and the target classification can be mapped onto the image to be processed only by up-sampling the segmentation mask to the resolution of the original image to be processed, so that example segmentation is realized.

In one specific embodiment, since a one-to-one assignment will force most predictions to be background, which will reduce the confidence of the class, this embodiment introduces a cross-over ratio IoU to adjust the target classification probability when training the example segmented network. That is, the enhanced feature map is input to the instance branch, and a target box and a target classification are obtained, wherein the probability of the target classification corresponding to the target box i is P _i And IoU is introduced to adjust the target classification probability, the formula is as follows:

wherein the content of the first and second substances,

for the adjusted target classification probability, S _i Denotes the intersection ratio IoU, P _i Representing objectsAnd f, the target classification probability corresponding to the box i. S _i The calculation formula is as follows:

wherein a is _x,y And b _x,y Respectively representing the pixels of the predicted target frame a and the real target frame b at (x, y).

In a specific embodiment, when the training example segments the network, the method also calculates the overall loss function of the network, performs back propagation, and updates the network parameters.

Binary cross-entropy loss function in which the network global loss function is classified by the target

Masked loss function

Target frame loss function

Can be expressed as:

wherein the content of the first and second substances,

which represents the loss of the classification of the object,

a loss of the mask is indicated and,

indicating frame loss, λ _cls Are the weight coefficients.

To solve the end-to-end training, label assignment is expressed as bipartite graph matching, first, a matching score based on paired dice is proposed:

the ith prediction and kth truth object used in the equation, which is determined by the classification score and dice coefficients of the split mask, where a is a hyperparameter set to 0.8, to balance the classification and splitting of the impact,

probability that the class predicted by the ith instance is the kth true object class, m _i And t _k The mask of the ith prediction object and the mask of the kth true object, respectively, the DICE coefficient defines:

wherein

And

the pixels at (x, y) of the ith prediction mask m and the kth true object mask t are represented separately, and then the hungarian algorithm is used to find the best match between the K true objects and the N predictions. By combining the dice-loss and pixel-level binary cross-entropy loss split mask loss functions:

wherein the content of the first and second substances,

and

for die losses and pixel-level binary cross-entropy losses, λ _dice And λ _pix For the purpose of the corresponding coefficients, the coefficients,

the above-mentioned embodiments only express several embodiments of the present application, and the description thereof is specific and detailed, but not to be understood as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A real-time instance segmentation method fusing a sparse frame and spatial attention is characterized in that a constructed instance segmentation network is adopted to segment instances of an image, the instance segmentation network comprises a feature extraction network, a feature enhancement network, a mask branch and an instance branch, and the real-time instance segmentation method fusing the sparse frame and the spatial attention comprises the following steps:

acquiring an image to be processed, and inputting the image to a feature extraction network to extract a multi-scale feature map;

and mapping the segmentation mask, the target frame and the target classification to the image to be processed to obtain an example segmentation result.

2. The sparse framework and spatial attention fusing real-time instance segmentation method of claim 1, wherein the feature extraction network employs ResNet50, using the outputs of residual modules 3, 4 and 5 in ResNet50 as extracted third scale feature map, second scale feature and first scale feature map;

3. The method for fusing sparse framework and spatial attention real-time instance segmentation according to claim 1, wherein the inputting the enhanced feature map into the instance branch to obtain the target box and the target classification comprises:

initializing a suggestion box and suggestion features;

and inputting the interesting characteristic and the suggested characteristic into a dynamic convolution head to generate a final target frame and a target classification.

4. The method for fusing sparse framework and spatial attention real-time instance segmentation according to claim 1, wherein the spatial attention module performs the following operations:

performing maximum pooling and average pooling on the input feature map in channel dimension to obtain pooled feature P _max And P _avg Then P is added _max And P _avg Dot product, convolution by 3X 3, sigmoid operation andand multiplying the original input feature maps to obtain the spatial attention feature.

5. The sparse framework and spatial attention fused real-time instance segmentation method according to claim 1, wherein the mask kernel generation module performs the following operations:

6. The sparse framework and spatial attention fusing real-time instance segmentation method of claim 1, wherein in training the instance segmentation network, the inputting of the enhanced feature map into instance branches results in a target box and a target classification, further comprising:

the intersection ratio IoU is introduced to adjust the target classification probability, and the formula is as follows:

wherein the content of the first and second substances,

7. The sparse framework and spatial attention fusing real-time instance segmentation method according to claim 1, wherein when training the instance segmentation network, the network overall loss function is as follows:

wherein, the first and the second end of the pipe are connected with each other,

which represents the loss of the classification of the object,

it is indicated that the mask is lost,

indicating frame loss, λ _cls Is a weight coefficient;

and

for die losses and pixel-level binary cross-entropy losses, λ _dice And λ _pix Is the corresponding weight coefficient.