CN115100410A - Real-time instance segmentation method integrating sparse framework and spatial attention - Google Patents

Real-time instance segmentation method integrating sparse framework and spatial attention Download PDF

Info

Publication number
CN115100410A
CN115100410A CN202210803057.XA CN202210803057A CN115100410A CN 115100410 A CN115100410 A CN 115100410A CN 202210803057 A CN202210803057 A CN 202210803057A CN 115100410 A CN115100410 A CN 115100410A
Authority
CN
China
Prior art keywords
mask
feature
feature map
target
inputting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210803057.XA
Other languages
Chinese (zh)
Inventor
刘盛
陈俊皓
张峰
郭炳男
陈瑞祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN202210803057.XA priority Critical patent/CN115100410A/en
Publication of CN115100410A publication Critical patent/CN115100410A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/766Arrangements for image or video recognition or understanding using pattern recognition or machine learning using regression, e.g. by projecting features on hyperplanes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a real-time instance segmentation method fusing a sparse frame and spatial attention, which comprises the steps of firstly obtaining an image to be processed, inputting the image to a feature extraction network to extract a multi-scale feature map, and inputting the multi-scale feature map to a feature enhancement network to obtain an enhanced feature map; inputting the enhanced feature map into an example branch to obtain a target frame and a target classification, splicing the enhanced feature map and the target frame output by the example branch, inputting the spliced enhanced feature map and the target frame output by the example branch into a mask branch, performing convolution operation, then respectively performing a spatial attention module and a mask kernel generation module to obtain a spatial attention feature and a mask kernel, and multiplying the spatial attention feature and the mask kernel to obtain a segmentation mask; and finally, mapping the segmentation mask, the target frame and the target to the image to be processed in a classified manner to obtain an example segmentation result. The invention improves the speed and the precision of the example segmentation task, and can obtain real-time and accurate segmentation results when continuous video frames are input.

Description

Real-time instance segmentation method integrating sparse framework and spatial attention
Technical Field
The application belongs to the technical field of instance segmentation, and particularly relates to a real-time instance segmentation method fusing a sparse frame and space attention.
Background
Example segmentation has become one of the more important, complex and challenging areas in machine vision research. It is helpful for many downstream tasks including crowd detection, autopilot, video surveillance, etc.
The purpose of target detection is to detect the category of an image target and to give the position of the classified image target in the form of a bounding box or a center; the semantic segmentation aims to predict the label of each pixel point in the image to obtain an accurate reasoning result, and each pixel is classified and marked according to the object or the area where the pixel is located. Instance segmentation segments objects in the image (generates a mask) and provides different labels for different object instances belonging to the same object class. Thus, instance segmentation may be defined as solving the task of object detection while solving semantic segmentation, decomposing each segmented object into its respective subcomponents.
One classic and effective example segmentation idea in the prior art is based on a suggestion box generation algorithm, which uses a suggestion box to detect an object and then segments the suggestion box. At present, three common suggested box detectors are respectively a dense detector, a dense-Sparse detector and a Sparse detector, and common example segmentation algorithm models are Mask RCNN, Polar Mask and Sparse RCNN.
Mask RCNN is a typical top-down detection method, adding a new branch at the end of the dense-sparse detector fast RCNN, segmenting features after ROI Align (a feature alignment method). The method obeys a paradigm of first detection and then segmentation, a dense prior target frame is generated by a model through an RPN, and then a mask segmentation head is used for carrying out front and back background segmentation on the detected target frame. This approach can effectively eliminate the influence of background on segmentation, but it is sensitive to the performance of the detector and requires that the detection performance in the first step is good enough.
Polar Mask is improved on dense detector FCOS, and unifies the example segmentation into a form of full convolution network, that is, the non-convolution operation of ROI Align is not used, and Polar Mask models a Mask from a binary image into an image consisting of several rays under a Polar coordinate system, thereby realizing a single-stage example segmentation framework.
The greatest characteristic of Sparse RCNN is sparsity of the whole process of target detection, and input is a set of Sparse suggestion boxes, suggestion features and one-to-one interactivity. The whole detection process has no dense suggestion area and dense (global) characteristics, avoids the design of a large number of prior frames and the many-to-one mapping of the prior frames and real frames, has better effect than the generation algorithm of dense suggestion frames, does not need post-processing operations such as non-maximum suppression and the like, and realizes a complete end-to-end target detector.
Most of the previous target detectors are dense detectors, such as Mask RCNN, namely, the target detectors are established on the basis of dense suggestions, are preset on an image grid or a characteristic graph grid in advance, then the scores and the offsets of the suggestions are predicted, the scores and the offsets are judged through an IOU (input output Unit), and then the scores and the offsets are screened through an NMS (network management System); a small part is a dense-sparse detector, such as Polar Mask, which first extracts fewer (sparse) foreground frames, i.e., region candidate frames, from the dense proposed region, and then classifies and regresses the position for each region candidate frame, removing from thousands of candidates to a few foreground. In both of the above methods, since each candidate block is passed through the convolutional neural network alone, and there are also tedious a priori block design and post-processing operations, it takes much time.
Disclosure of Invention
The method aims to provide a real-time instance segmentation method fusing a sparse frame and space attention, avoids a large number of complex post-processing operations such as design of a priori frame and non-maximum suppression, and achieves rapid reasoning.
In order to achieve the purpose, the technical scheme of the application is as follows:
a real-time instance segmentation method fusing a sparse framework and spatial attention is characterized in that an image is subjected to instance segmentation by adopting a constructed instance segmentation network, the instance segmentation network comprises a feature extraction network, a feature enhancement network, a mask branch and an instance branch, and the real-time instance segmentation method fusing the sparse framework and the spatial attention comprises the following steps:
acquiring an image to be processed, inputting the image to a feature extraction network, and extracting a multi-scale feature map;
inputting the multi-scale feature map into a feature enhancement network to obtain an enhanced feature map;
inputting the enhanced feature map into the example branch to obtain a target frame and a target classification;
splicing the enhanced feature graph and a target frame output by the example branch, inputting the spliced enhanced feature graph and the target frame into a mask branch, performing convolution operation, then respectively performing a space attention module and a mask kernel generation module to obtain a space attention feature and a mask kernel, and multiplying the space attention feature and the mask kernel to obtain a segmentation mask;
and mapping the segmentation mask, the target frame and the target to the image to be processed in a classified manner to obtain an example segmentation result.
Further, the feature extraction network adopts ResNet50 to take the outputs of residual modules 3, 4 and 5 in ResNet50 as the extracted third scale feature map, second scale feature and first scale feature map;
inputting the multi-scale feature map into a feature enhancement network to obtain an enhanced feature map, wherein the method comprises the following steps:
inputting the first scale feature map into a pyramid pooling module, and outputting a first feature map;
adding elements of the second scale characteristic diagram and the first characteristic diagram to obtain a second characteristic diagram;
adding elements of the third scale characteristic diagram and the second characteristic diagram to obtain a third characteristic diagram;
and respectively carrying out convolution operation on the first characteristic diagram, the second characteristic diagram and the third characteristic diagram, and then splicing to obtain an enhanced characteristic diagram.
Further, the inputting the enhanced feature map into the instance branch to obtain the target box and the target classification includes:
initializing a suggestion box and suggestion features;
the enhanced feature map is focused through regional features to extract interesting features;
and inputting the interested features and the suggested features into a dynamic convolution head to generate a final target frame and a target classification.
Further, the spatial attention module performs the following operations:
performing maximum pooling and average pooling on the input feature map in channel dimension to obtain pooled feature P max And P avg Then P is added max And P avg After dot product is carried out, a convolution of 3 multiplied by 3 is carried out, then sigmoid operation is carried out, and multiplication is carried out on the sigmoid operation and the original input feature graph to obtain the space attention feature.
Further, the mask kernel generation module performs the following operations:
dimension is adjusted through linear layers on the input feature graph, and a 128-dimensional mask kernel is generated.
Further, when training the example segmentation network, the inputting the enhanced feature map into the example branch to obtain the target box and the target classification further includes:
and introducing an intersection ratio IoU to adjust the target classification probability, wherein the formula is as follows:
Figure BDA0003735040430000041
wherein the content of the first and second substances,
Figure BDA0003735040430000042
for the adjusted target classification probability, S i Denotes the intersection ratio IoU, P i And representing the classification probability of the target corresponding to the target box i.
Further, when training the example segmentation network, the overall loss function of the network is as follows:
Figure BDA0003735040430000043
wherein the content of the first and second substances,
Figure BDA0003735040430000044
which represents the loss of the classification of the object,
Figure BDA0003735040430000045
it is indicated that the mask is lost,
Figure BDA0003735040430000046
indicating frame loss, λ cls Is a weight coefficient;
Figure BDA0003735040430000047
wherein the content of the first and second substances,
Figure BDA0003735040430000048
and
Figure BDA0003735040430000049
for the dice loss and the pixel level binary cross entropy loss, λ dice And λ pix Is the corresponding weight coefficient.
The method improves the existing characteristic extraction and enhancement method, improves the efficiency of the stage, avoids the design of a large number of prior frames and the many-to-one mapping of the prior frames and real frames based on a pure sparse target detection algorithm, does not need complex post-processing operations such as non-maximum suppression and the like, and improves the attention of a network to example characteristics by fusing a space attention module. The method and the device improve the speed and the precision of the example segmentation task, and can obtain real-time and accurate segmentation results when continuous video frames are input.
Drawings
FIG. 1 is a flow chart of an example segmentation method of the present application;
FIG. 2 is a schematic diagram of an example split network architecture of the present application;
FIG. 3 is a schematic diagram of a pyramid pooling module;
FIG. 4 is a schematic diagram of a dynamic convolution head;
fig. 5 is a schematic view of a spatial attention module.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
In one embodiment, as shown in fig. 1, a real-time instance segmentation method fusing a sparse framework and spatial attention is provided, and an image is subjected to instance segmentation by using a constructed instance segmentation network, wherein the instance segmentation network comprises a feature extraction network, a feature enhancement network, a mask branch and an instance branch. The real-time instance segmentation method fusing the sparse framework and the spatial attention comprises the following steps:
and step S1, acquiring the image to be processed, inputting the image to the feature extraction network, and extracting the multi-scale feature map.
In the embodiment, a to-be-processed Image is given to be in the range of R 3×H×W Where 3 represents the RGB channel and H and W represent the height and width of the image, are preprocessed.
In network model training and practical application, the preprocessing mode of pictures is different. In the training process, in order to enhance the generalization performance of the model, the pictures need to be subjected to data augmentation, and the pictures are firstly turned left and right with a probability of 0.5, so that the data amount of the training set is increased by one time. Then, the picture is randomly cut, and then the picture is zoomed to a certain size. Since the input picture has a pixel range of 0 to 255, and training is unstable in this range, the pixel values of the picture need to be scaled to 0 to 1 in an equal ratio, and therefore the preprocessing needs to further normalize the pixel values of the image.
However, in the practical application after the network model is trained, data amplification is not needed to be carried out on the pictures, and only inference input consistent with that in the training process is needed. Specifically, the image to be processed only needs to be scaled and normalized (random flipping and cropping are not needed).
The present embodiment adopts ResNet50 as a feature extraction network (also referred to as backbone network backbone), and inputs the preprocessed pictures into ResNet50 to extract image features. The feature extraction network extracts multi-scale feature maps (res3, res4 and res5), and the output of the multi-scale feature maps is a series of feature maps with different sizes
Figure BDA0003735040430000051
Therein
Figure BDA0003735040430000052
Is (512,1024,2048) in the following order,
Figure BDA0003735040430000053
corresponding to 1/8, 1/16 and 1/32 of original height, and the same principle
Figure BDA0003735040430000054
It corresponds to the width of the original, in the same proportion as the height, and the different feature maps obtained here are taken as the input of the next part.
And step S2, inputting the multi-scale feature map into the feature enhancement network to obtain an enhanced feature map.
In the step, the multi-scale feature map is transmitted into a feature enhancement network, which is also called a hack network, so as to obtain an enhanced feature map.
In a specific embodiment, the feature enhancement network performs the following operations:
inputting the first scale feature map into a pyramid pooling module, and outputting a first feature map;
element addition is carried out on the second scale characteristic diagram and the first characteristic diagram to obtain a second characteristic diagram;
adding elements of the third scale characteristic diagram and the second characteristic diagram to obtain a third characteristic diagram;
and respectively carrying out convolution operation on the first characteristic diagram, the second characteristic diagram and the third characteristic diagram, and then splicing to obtain an enhanced characteristic diagram.
As shown in fig. 2, the input image to be processed passes through ResNet50, and the outputs res3, res4 and res5 of the residual modules 3, 4 and 5 in the ResNet50 are respectively referred to as a third scale feature map, a second scale feature and a first scale feature map; and then input to a feature enhancement network (neck). In the feature enhancement network, res5 is input to a Pyramid Pooling Module (PPM) to obtain a first feature map, res4 is added to the first feature map to obtain a second feature map, and res3 is added to the second feature map to obtain a third feature map; and then the first characteristic diagram, the second characteristic diagram and the third characteristic diagram are spliced after convolution operation respectively to obtain an enhanced characteristic diagram.
Wherein res5 is input to a Pyramid Pooling Module (PPM) to obtain a first feature map, which can enlarge the receptive field. Pyramid pooling module as shown in fig. 3, the input features of the pyramid pooling module pass through 4 layers of pyramid pooling modules (pool) from the input side to the output side, the dimensions of each layer being 1 × 1,2 × 2,3 × 3,6 × 6, respectively. Pooling the feature maps to target sizes, respectively, and then performing a 1 × 1 convolution (conv) on the pooled results to reduce the channels to 1/N, where N is 4. Then, each feature map of the previous step is subjected to bilinear interpolation upsampling (upsampling) to obtain the same size of the original feature map, and then the original feature map and the feature map obtained by upsampling are connected according to channel dimensions (concat). The obtained channel number is twice of the original feature map channel number, and finally the 1 multiplied by 1 convolution is used for reducing the channel to the original channel. The final feature map is the same size and channel as the original feature map.
And step S3, inputting the enhanced feature map into the example branch to obtain the target frame and the target classification.
The embodiment branches, and includes the following specific operations:
step S31, initialize the suggestion box and suggest features.
Before an enhanced feature graph input Instance Branch (Instance Branch), a set of learnable parameters N × 4 representing the suggestion boxes init _ bboxes is initialized, N being a hyper-parameter representing the number of initial suggestion boxes, and a set of learnable suggestion features corresponding to the suggestion boxes, N × d, are initialized. Namely, N suggestion boxes and N corresponding suggestion features are obtained through initialization.
And step S32, extracting interesting features from the enhanced feature map through regional feature focusing.
The embodiment extracts the interesting features ROI _ features by focusing the region features (ROI Align) on the enhanced feature map P. Since there are N suggestion boxes initialized, N features of interest can be obtained. ROI Align is a well established technique in the art and will not be described in detail herein.
And step S33, inputting the interested features and the suggested features into a dynamic convolution head, and generating a final target frame and a target classification.
In this embodiment, as shown in fig. 4, a Dynamic instance interactive head (Dynamic instance interactive head) executes sequential iteration of k Dynamic convolution layers, where k is a hyper-parameter, that is, a preset number of layers of Dynamic convolution iteration.
The inputs to the dynamic convolution header are the features of interest roi _ features and the proposed features pro-visual _ features, which can be considered as a mechanism of attention, the proposed features generate the parameters of the convolution kernel and then act on the features of interest to obtain the final prediction result.
And the output characteristic and the output suggestion frame of the last dynamic convolution layer are used as the suggestion characteristic and the suggestion frame of the next dynamic convolution layer, and the final target frame and the target classification are generated through multiple iterations.
It should be noted that, in this embodiment, each feature of interest and the corresponding proposed feature are sequentially subjected to dynamic convolution, so that all the target frames and the target classifications are obtained.
In the embodiment, a set of learnable parameters N × 4 is initialized to represent a suggestion box, where the learned suggestion box can be understood as a statistical value of positions where objects may appear in an image, and a set of learnable suggestion features corresponding to the suggestion box, with a size of N × d, are initialized to characterize target features, and the number of the suggestion features corresponds to the suggestion box. ROI Align provides interesting features, the mode can avoid losing information of an original feature map in the process, the maximum information integrity is guaranteed by non-quantization in the whole process of the intermediate process, then the obtained features are sent to a dynamic convolution head, and the input of the detection head is the interesting features and the suggested features. For each feature of interest, a suggested feature is associated, and the final output feature is obtained. The proposed feature may be viewed as an attention mechanism, where the proposed feature generates parameters (params) of a convolution kernel, which is then applied to the feature of interest to obtain a final prediction result. And the output characteristic and the output suggestion box of the previous dynamic convolution are used as the suggestion characteristic and the suggestion box of the next dynamic convolution, and the final target box and the target classification are generated through multiple iterations. Dynamic convolution headers are well established in the art and will not be described in detail herein.
And step S4, splicing the enhanced feature map and the target frame output by the example branch, inputting the spliced enhanced feature map and the target frame into a mask branch, firstly performing convolution operation, then respectively performing a spatial attention module and a mask kernel generation module to obtain a spatial attention feature and a mask kernel, and multiplying the spatial attention feature and the mask kernel to obtain a segmentation mask.
In this embodiment, the enhanced feature map is spliced with the target frame output by the example branch and then input to the mask branch, and the convolution operation is performed first. That is, before the enhanced feature map is input into a Mask Branch (Mask Branch), the spatiality of the feature map is increased, and the normalized coordinate information added in the first convolution can be represented by the following formula:
mask_features=(mask_convs(cat(coord_features,P)))
where mask _ convs represents 4 convolutions of 3 × 3, cat represents the procedure of controlling, coord _ features represents normalized coordinate information, P represents a feature map of output enhancement, and mask _ features represents a feature map obtained by 4 convolutions of 3 × 3.
It should be noted that the normalized coordinate information of the present embodiment comes from the target box of the example branch output.
The obtained mask _ features feature map is then input to a SAM module (spatial attention module) and a mask kernel generation module (mask kernel), respectively.
As shown in FIG. 5, the SAM module maps the input feature maps, masks _ features ∈ R C×H×W Performing maximum pooling (Max Pool) and average pooling (Avg Pool) on the input feature map in channel dimension to obtain pooled feature P max ,P avg ∈R 1×H×W Then P is added max And P avg After dot product, a 3 × 3 convolution (recovery channel) is performed, then sigmoid operation is performed, and then multiplication (element-wise) is performed on the sigmoid operation and the original input to obtain the spatial attention feature.
Specifically, the following formula is adopted to represent the corresponding operation of the SAM module:
Figure BDA0003735040430000081
wherein A is sam (mask _ features) represents the feature after passing the sigmoid, σ represents the sigmoid function, and σ represents the dot product.
Figure BDA0003735040430000082
Wherein the spatial _ attitude _ features sam Representing the spatial attention feature that the SAM module finally outputs,
Figure BDA0003735040430000083
denotes the operation of multiplication (multiplication).
In the mask kernel module of this embodiment, dimensions of an input feature map mask _ features are adjusted through a linear layer, so as to generate a 128-dimensional mask kernel. The mask kernel module is formulated as follows:
w=Linear 256->128 (mask_features)
wherein w ∈ R 1×C Representing a mask kernel.
Finally, multiplying the space attention feature by a mask kernel to obtain a division mask output by a mask branch, and expressing the division mask by a formula as follows:
Figure BDA0003735040430000091
where pred _ mask is the split mask of the mask branch output, w ∈ R 1×C Indicates the mask kernel, spatial _ attribute _ features sam ∈R C×H×W Is a spatial attention feature.
And step S5, mapping the segmentation mask, the target frame and the target classification to the image to be processed to obtain an example segmentation result.
Through the steps, the segmentation mask, the target frame and the target classification are obtained respectively, and finally the segmentation mask, the target frame and the target classification can be mapped onto the image to be processed only by up-sampling the segmentation mask to the resolution of the original image to be processed, so that example segmentation is realized.
In one specific embodiment, since a one-to-one assignment will force most predictions to be background, which will reduce the confidence of the class, this embodiment introduces a cross-over ratio IoU to adjust the target classification probability when training the example segmented network. That is, the enhanced feature map is input to the instance branch, and a target box and a target classification are obtained, wherein the probability of the target classification corresponding to the target box i is P i And IoU is introduced to adjust the target classification probability, the formula is as follows:
Figure BDA0003735040430000092
wherein the content of the first and second substances,
Figure BDA0003735040430000093
for the adjusted target classification probability, S i Denotes the intersection ratio IoU, P i Representing objectsAnd f, the target classification probability corresponding to the box i. S i The calculation formula is as follows:
Figure BDA0003735040430000094
wherein a is x,y And b x,y Respectively representing the pixels of the predicted target frame a and the real target frame b at (x, y).
In a specific embodiment, when the training example segments the network, the method also calculates the overall loss function of the network, performs back propagation, and updates the network parameters.
Binary cross-entropy loss function in which the network global loss function is classified by the target
Figure BDA0003735040430000095
Masked loss function
Figure BDA0003735040430000096
Target frame loss function
Figure BDA0003735040430000097
Can be expressed as:
Figure BDA0003735040430000098
wherein the content of the first and second substances,
Figure BDA0003735040430000099
which represents the loss of the classification of the object,
Figure BDA00037350404300000910
a loss of the mask is indicated and,
Figure BDA00037350404300000911
indicating frame loss, λ cls Are the weight coefficients.
To solve the end-to-end training, label assignment is expressed as bipartite graph matching, first, a matching score based on paired dice is proposed:
Figure BDA00037350404300000912
the ith prediction and kth truth object used in the equation, which is determined by the classification score and dice coefficients of the split mask, where a is a hyperparameter set to 0.8, to balance the classification and splitting of the impact,
Figure BDA0003735040430000101
probability that the class predicted by the ith instance is the kth true object class, m i And t k The mask of the ith prediction object and the mask of the kth true object, respectively, the DICE coefficient defines:
Figure BDA0003735040430000102
wherein
Figure BDA0003735040430000103
And
Figure BDA0003735040430000104
the pixels at (x, y) of the ith prediction mask m and the kth true object mask t are represented separately, and then the hungarian algorithm is used to find the best match between the K true objects and the N predictions. By combining the dice-loss and pixel-level binary cross-entropy loss split mask loss functions:
Figure BDA0003735040430000105
wherein the content of the first and second substances,
Figure BDA0003735040430000106
and
Figure BDA0003735040430000107
for die losses and pixel-level binary cross-entropy losses, λ dice And λ pix For the purpose of the corresponding coefficients, the coefficients,
Figure BDA0003735040430000108
the above-mentioned embodiments only express several embodiments of the present application, and the description thereof is specific and detailed, but not to be understood as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (7)

1. A real-time instance segmentation method fusing a sparse frame and spatial attention is characterized in that a constructed instance segmentation network is adopted to segment instances of an image, the instance segmentation network comprises a feature extraction network, a feature enhancement network, a mask branch and an instance branch, and the real-time instance segmentation method fusing the sparse frame and the spatial attention comprises the following steps:
acquiring an image to be processed, and inputting the image to a feature extraction network to extract a multi-scale feature map;
inputting the multi-scale feature map into a feature enhancement network to obtain an enhanced feature map;
inputting the enhanced feature map into the example branch to obtain a target frame and a target classification;
splicing the enhanced feature graph and a target frame output by the example branch, inputting the spliced enhanced feature graph and the target frame into a mask branch, performing convolution operation, then respectively performing a space attention module and a mask kernel generation module to obtain a space attention feature and a mask kernel, and multiplying the space attention feature and the mask kernel to obtain a segmentation mask;
and mapping the segmentation mask, the target frame and the target classification to the image to be processed to obtain an example segmentation result.
2. The sparse framework and spatial attention fusing real-time instance segmentation method of claim 1, wherein the feature extraction network employs ResNet50, using the outputs of residual modules 3, 4 and 5 in ResNet50 as extracted third scale feature map, second scale feature and first scale feature map;
inputting the multi-scale feature map into a feature enhancement network to obtain an enhanced feature map, wherein the method comprises the following steps:
inputting the first scale feature map into a pyramid pooling module, and outputting a first feature map;
adding elements of the second scale characteristic diagram and the first characteristic diagram to obtain a second characteristic diagram;
adding elements of the third scale characteristic diagram and the second characteristic diagram to obtain a third characteristic diagram;
and respectively carrying out convolution operation on the first characteristic diagram, the second characteristic diagram and the third characteristic diagram, and then splicing to obtain an enhanced characteristic diagram.
3. The method for fusing sparse framework and spatial attention real-time instance segmentation according to claim 1, wherein the inputting the enhanced feature map into the instance branch to obtain the target box and the target classification comprises:
initializing a suggestion box and suggestion features;
the enhanced feature map is focused through regional features to extract interesting features;
and inputting the interesting characteristic and the suggested characteristic into a dynamic convolution head to generate a final target frame and a target classification.
4. The method for fusing sparse framework and spatial attention real-time instance segmentation according to claim 1, wherein the spatial attention module performs the following operations:
performing maximum pooling and average pooling on the input feature map in channel dimension to obtain pooled feature P max And P avg Then P is added max And P avg Dot product, convolution by 3X 3, sigmoid operation andand multiplying the original input feature maps to obtain the spatial attention feature.
5. The sparse framework and spatial attention fused real-time instance segmentation method according to claim 1, wherein the mask kernel generation module performs the following operations:
dimension is adjusted through linear layers on the input feature graph, and a 128-dimensional mask kernel is generated.
6. The sparse framework and spatial attention fusing real-time instance segmentation method of claim 1, wherein in training the instance segmentation network, the inputting of the enhanced feature map into instance branches results in a target box and a target classification, further comprising:
the intersection ratio IoU is introduced to adjust the target classification probability, and the formula is as follows:
Figure FDA0003735040420000021
wherein the content of the first and second substances,
Figure FDA0003735040420000022
for the adjusted target classification probability, S i Denotes the intersection ratio IoU, P i And representing the classification probability of the target corresponding to the target box i.
7. The sparse framework and spatial attention fusing real-time instance segmentation method according to claim 1, wherein when training the instance segmentation network, the network overall loss function is as follows:
Figure FDA0003735040420000023
wherein, the first and the second end of the pipe are connected with each other,
Figure FDA0003735040420000024
which represents the loss of the classification of the object,
Figure FDA0003735040420000025
it is indicated that the mask is lost,
Figure FDA0003735040420000026
indicating frame loss, λ cls Is a weight coefficient;
Figure FDA0003735040420000027
wherein, the first and the second end of the pipe are connected with each other,
Figure FDA0003735040420000028
and
Figure FDA0003735040420000029
for die losses and pixel-level binary cross-entropy losses, λ dice And λ pix Is the corresponding weight coefficient.
CN202210803057.XA 2022-07-07 2022-07-07 Real-time instance segmentation method integrating sparse framework and spatial attention Pending CN115100410A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210803057.XA CN115100410A (en) 2022-07-07 2022-07-07 Real-time instance segmentation method integrating sparse framework and spatial attention

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210803057.XA CN115100410A (en) 2022-07-07 2022-07-07 Real-time instance segmentation method integrating sparse framework and spatial attention

Publications (1)

Publication Number Publication Date
CN115100410A true CN115100410A (en) 2022-09-23

Family

ID=83296081

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210803057.XA Pending CN115100410A (en) 2022-07-07 2022-07-07 Real-time instance segmentation method integrating sparse framework and spatial attention

Country Status (1)

Country Link
CN (1) CN115100410A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117593530A (en) * 2024-01-19 2024-02-23 杭州灵西机器人智能科技有限公司 Dense carton segmentation method and system

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117593530A (en) * 2024-01-19 2024-02-23 杭州灵西机器人智能科技有限公司 Dense carton segmentation method and system

Similar Documents

Publication Publication Date Title
US10943145B2 (en) Image processing methods and apparatus, and electronic devices
CN110097568B (en) Video object detection and segmentation method based on space-time dual-branch network
CN110516536B (en) Weak supervision video behavior detection method based on time sequence class activation graph complementation
CN113158723B (en) End-to-end video motion detection positioning system
Deng et al. MLOD: A multi-view 3D object detection based on robust feature fusion method
CN111079739A (en) Multi-scale attention feature detection method
CN111898432A (en) Pedestrian detection system and method based on improved YOLOv3 algorithm
CN111401293A (en) Gesture recognition method based on Head lightweight Mask scanning R-CNN
CN114187311A (en) Image semantic segmentation method, device, equipment and storage medium
Ma et al. Fusioncount: Efficient crowd counting via multiscale feature fusion
Amirian et al. Dissection of deep learning with applications in image recognition
CN115424017B (en) Building inner and outer contour segmentation method, device and storage medium
CN112418032A (en) Human behavior recognition method and device, electronic equipment and storage medium
US20230154139A1 (en) Systems and methods for contrastive pretraining with video tracking supervision
CN116129291A (en) Unmanned aerial vehicle animal husbandry-oriented image target recognition method and device
Harkat et al. Fire detection using residual deeplabv3+ model
Yuan et al. A lightweight network for smoke semantic segmentation
CN115100410A (en) Real-time instance segmentation method integrating sparse framework and spatial attention
CN114996495A (en) Single-sample image segmentation method and device based on multiple prototypes and iterative enhancement
Aldhaheri et al. MACC Net: Multi-task attention crowd counting network
Yuan et al. A cross-scale mixed attention network for smoke segmentation
Quiroga et al. A study of convolutional architectures for handshape recognition applied to sign language
Cao et al. QuasiVSD: efficient dual-frame smoke detection
Kurama et al. Image semantic segmentation using deep learning
CN117315752A (en) Training method, device, equipment and medium for face emotion recognition network model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination