CN117373020A

CN117373020A - Instance segmentation method and system based on geometric constraint dynamic convolution

Info

Publication number: CN117373020A
Application number: CN202311345265.0A
Authority: CN
Inventors: 丛润民; 陈锦芃; 张伟; 孙豪言; 宋然
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2023-10-17
Filing date: 2023-10-17
Publication date: 2024-01-09

Abstract

The invention discloses an example segmentation method and system based on geometric constraint dynamic convolution, which are used for extracting multi-level features of an image to be segmented to obtain multi-level features; performing instance sensing on the multi-level features to obtain centers, corresponding category confidence degrees and boundary boxes of all prediction instances; suppressing based on the bounding box to obtain a reserved instance center; extracting the bottom features of the multi-level features to obtain the bottom features; performing central feature extraction based on the multi-level features and each reserved example center, and generating a dynamic convolution kernel for peripheral point positioning; performing dynamic convolution operation based on the dynamic convolution check bottom features for peripheral point diagram positioning, and performing peripheral point diagram prediction to generate a peripheral point diagram; performing feature extraction and differential feature fusion based on the multi-level features, the reserved instance center and the reserved peripheral point diagram, and generating a dynamic convolution kernel for segmentation; and performing dynamic convolution operation on the bottom feature based on the dynamic convolution kernel for segmentation to obtain a segmentation mask.

Description

Instance segmentation method and system based on geometric constraint dynamic convolution

Technical Field

The invention relates to the technical field of instance segmentation, in particular to an instance segmentation method and system based on geometric constraint dynamic convolution.

Background

The statements in this section merely relate to the background of the present disclosure and may not necessarily constitute prior art.

Instance segmentation (Instance Segmentation) has become an important research direction in the field of computer vision, which aims to identify and distinguish individual object instances from images and to precisely divide the contours for each instance. This task bridges Object Detection with semantic segmentation (Semantic Segmentation) because it not only needs to identify the Object class in the image, but also needs to distinguish between different instances in the same class. Compared with the traditional target detection and semantic segmentation methods, the example segmentation method can provide more detailed and rich scene analysis, and can be widely applied to the fields of robots, augmented reality, medical image analysis and the like.

The most advanced example segmentation methods have been implemented using deep neural networks since the advent of the deep learning era. Among these, convolutional neural networks (Convolutional Neural Network, CNN) are again the most dominant option. The most classical example segmentation method is Mask R-CNN proposed by He et al in 2017, and many of the following methods are based on the main architecture of Mask R-CNN. Such models typically locate salient instances first by bounding boxes and then crop out regions of interest from the complete feature map (Region ofInterest, roI). It breaks down the overall task into two subtasks: detection and segmentation, this division is intuitive and good performance can be quickly achieved by extending the existing object detector. However, it also has some drawbacks:

(1) RoIs are generally rectangular and aligned with the axis. When example shape irregularities or diagonal angles occur, the RoI may contain too much background and too few examples themselves, complicating segmentation.

(2) With RoIAlign, the rois of different spatial sizes are adjusted to a uniform size (e.g., 14 x 14). Such a strategy may compromise the quality of the segmentation, especially for large instances with complex boundaries.

(3) The result of the segmentation depends to a large extent on the preceding detection phase. Even if the segmentation head is excellent, it may perform poorly if the RoI is incomplete, as it can only capture features within the RoI.

In carrying out the invention, the inventors have found that at least the following drawbacks and deficiencies in the prior art are present:

the example segmentation method based on Mask R-CNN main body architecture has the following three defects:

(1) The rois are generally rectangular and aligned with the coordinate axes such that instances of irregular or diagonal shapes occupy less of their rois, making segmentation more difficult.

(2) The RoIs of different spatial sizes are sized to be uniform and cannot accommodate instances of different sizes.

(3) The outcome of the segmentation is largely dependent on the preceding detection stage, and a lower detection quality can greatly limit the segmentation quality.

The prior art has a method using dynamic convolution, thereby avoiding the problems related to the RoI, and has the following disadvantages:

(1) The generation of the dynamic convolution kernel only depends on the characteristics of a single point, and the captured characteristics are more unilateral, so that the comprehensive and accurate segmentation is affected;

(2) The dynamic convolution kernel only contains a representation of the feature, without adding geometric constraint information, and it is difficult to assist in accurate segmentation.

Disclosure of Invention

In order to solve the defects in the prior art, the invention provides an example segmentation method and system based on geometric constraint dynamic convolution;

in one aspect, an example segmentation method based on geometric constraint dynamic convolution is provided;

an instance segmentation method based on geometric constraint dynamic convolution, comprising the following steps:

acquiring an image to be segmented;

inputting the image to be segmented into a trained geometric constraint dynamic convolution network to obtain a segmentation mask and a category confidence corresponding to an image instance in the image to be segmented;

determining an image instance segmentation result corresponding to the image to be segmented based on a segmentation mask corresponding to the image instance and a category confidence coefficient in the image to be segmented;

wherein, the geometrical constraint dynamic convolution network after training is used for: carrying out multi-level feature extraction on the image to be segmented to obtain multi-level features; performing instance sensing on the multi-level features to obtain centers, corresponding category confidence degrees and boundary boxes of all prediction instances; non-maximum suppression (Non-Maximum Suppression, NMS) is performed based on the bounding box, resulting in a reserved instance center; extracting the bottom features from the multi-level features to obtain the bottom features; based on the multi-level features and the reserved instance center, extracting the center features to generate a dynamic convolution kernel for peripheral point positioning; performing dynamic convolution operation based on the dynamic convolution check bottom features for peripheral location, performing peripheral dot pattern prediction, and generating a peripheral dot pattern; performing feature extraction and differential feature fusion based on the multi-level features, the reserved instance center and the reserved peripheral point diagram, and generating a dynamic convolution kernel for segmentation; and performing dynamic convolution operation on the bottom feature based on the dynamic convolution kernel for segmentation to obtain a segmentation mask.

In another aspect, an example segmentation system based on geometric constraint dynamic convolution is provided;

an example segmentation system based on geometric constraint dynamic convolution, comprising:

an acquisition module configured to: acquiring an image to be segmented;

a prediction module configured to: inputting the image to be segmented into a trained geometric constraint dynamic convolution network to obtain a segmentation mask and a category confidence corresponding to an image instance in the image to be segmented;

a result output module configured to: determining an image instance segmentation result corresponding to the image to be segmented based on a segmentation mask corresponding to the image instance and a category confidence coefficient in the image to be segmented;

wherein, the geometrical constraint dynamic convolution network after training is used for: carrying out multi-level feature extraction on the image to be segmented to obtain multi-level features; performing instance sensing on the multi-level features to obtain centers, corresponding category confidence degrees and boundary boxes of all prediction instances; non-maximum suppression (Non-Maximum Suppression, NMS) is performed based on the bounding box, resulting in a reserved instance center; extracting the bottom features from the multi-level features to obtain the bottom features; based on the multi-level features and the reserved instance center, performing feature extraction to generate a dynamic convolution kernel for peripheral point positioning; performing dynamic convolution operation based on the dynamic convolution check bottom features for peripheral location, performing peripheral dot pattern prediction, and generating a peripheral dot pattern; performing feature extraction and differential feature fusion based on the multi-level features, the reserved instance center and the reserved peripheral point diagram, and generating a dynamic convolution kernel for segmentation; and performing dynamic convolution operation on the bottom feature based on the dynamic convolution kernel for segmentation to obtain a segmentation mask.

In still another aspect, there is provided an electronic device including:

a memory for non-transitory storage of computer readable instructions; and

a processor for executing the computer-readable instructions,

wherein the computer readable instructions, when executed by the processor, perform the method of the first aspect described above.

In yet another aspect, there is also provided a storage medium non-transitory storing computer readable instructions, wherein the instructions of the method of the first aspect are executed when the non-transitory computer readable instructions are executed by a computer.

In a further aspect, there is also provided a computer program product comprising a computer program for implementing the method of the first aspect described above when run on one or more processors. The technical scheme has the following advantages or beneficial effects:

(1) The invention proposes a dynamic convolution network (GCDCNet) with geometric constraint to solve the example segmentation task. It does not involve the use of regions of interest (Region ofInterests, roI), but rather employs a key point guided dynamic convolution (Keypoints-guided Dynamic Convolution, KGDC) mechanism to generate a mask (mask) directly from the complete feature. Experiments prove that GCDCNet is better than the existing method.

(2) In view of the constraint of geometric space, the invention introduces peripheral points based on the center to jointly generate the dynamic convolution kernel. In this way, not only can the feature patterns with comprehensive and diversified examples be captured, but also the network can be promoted to perform more targeted feature learning under the geometric constraint.

(3) The invention designs a differential feature fusion (Differentiated Patterns Fusion, DPF) module which is used for adaptively measuring the influence of features at different key points and ensuring the comprehensiveness of a final convolution kernel.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.

Fig. 1 is a diagram of a network overall architecture according to a first embodiment;

fig. 2 is a diagram showing an internal structure of a backbone network according to the first embodiment;

FIG. 3 is a schematic diagram of a backbone network and a feature pyramid network according to the first embodiment;

FIG. 4 is a schematic diagram showing the internal structure of a sorting head according to the first embodiment;

FIG. 5 is a schematic diagram showing the internal structure of a boundary prediction head according to the first embodiment;

FIG. 6 is a schematic diagram of the internal structure of a dynamic parameter generating head for peripheral point positioning according to the first embodiment;

Fig. 7 is a schematic diagram of the internal structure of a dynamic parameter generating head for segmentation according to the first embodiment;

FIG. 8 is a schematic view showing the internal structure of a bottom module according to the first embodiment;

FIG. 9 is a diagram illustrating a dynamic convolution operation according to the first embodiment;

FIG. 10 is a schematic diagram of the internal structure of a dynamic convolution module using keypoint guidance according to the first embodiment;

FIG. 11 is an internal architecture diagram of a DPF module of the first embodiment;

fig. 12 (a) is an input image of the first embodiment;

fig. 12 (b) is a true value of the first embodiment;

fig. 12 (c) shows the prediction result of S4 Net;

FIG. 12 (d) shows the result of SCG prediction;

FIG. 12 (e) is a predicted result of RDPNet;

fig. 12 (f) shows the prediction result of GCDCNet.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

Term interpretation: instance segmentation (Instance Segmentation) is a task in computer vision that aims to identify individual object instances in an image and assign a unique label to each instance while precisely demarcating the outline of each instance. Unlike semantic segmentation (Semantic Segmentation), instance segmentation not only identifies the class of objects, but also distinguishes different instances of the same class.

The currently prevailing convolutional neural network (Convolutional Neural Network, CNN) based instance segmentation (Instance Segmentation) method is typically built on top of the Mask R-CNN's body frame, i.e. the instances are first located by bounding boxes, and then segmented within each bounding box. However, this method has difficulty in handling examples of different sizes well, and the segmentation quality is highly dependent on the detection quality of the bounding box. Thus, some approaches employ dynamic convolution to achieve segmentation without the use of bounding boxes. However, these methods typically rely solely on features of a single point within an instance to generate a dynamic convolution kernel, which can capture features that are more monolithic, making it difficult to accurately segment an instance containing multiple different features. Therefore, improving the comprehensiveness of the dynamic convolution kernel to achieve a more complete segmentation is a key issue to be addressed by the present invention.

Example 1

The embodiment provides an example segmentation method based on geometric constraint dynamic convolution;

s101: acquiring an image to be segmented;

s102: inputting the image to be segmented into a trained geometric constraint dynamic convolution network to obtain a segmentation mask and a category confidence corresponding to an image instance in the image to be segmented;

S103: determining an image instance segmentation result corresponding to the image to be segmented based on a segmentation mask corresponding to the image instance and a category confidence coefficient in the image to be segmented;

wherein, the geometrical constraint dynamic convolution network after training is used for:

s102-1: carrying out multi-level feature extraction on the image to be segmented to obtain multi-level features;

s102-2: performing instance sensing on the multi-level features to obtain centers, corresponding category confidence degrees and boundary boxes of all prediction instances;

s102-3: non-maximum suppression (Non-Maximum Suppression, NMS) is performed based on the bounding box, resulting in a reserved instance center;

s102-4: extracting the bottom features from the multi-level features to obtain the bottom features;

s102-5: based on the multi-level features and the reserved instance center, extracting the center features to generate a dynamic convolution kernel for peripheral point positioning;

s102-6: performing dynamic convolution operation based on the dynamic convolution check bottom features for peripheral location, performing peripheral dot pattern prediction, and generating a peripheral dot pattern;

s102-7: performing feature extraction and differential feature fusion based on the multi-level features, the reserved instance center and the reserved peripheral point diagram, and generating a dynamic convolution kernel for segmentation;

S102-8: and performing dynamic convolution operation on the bottom feature based on the dynamic convolution kernel for segmentation to obtain a segmentation mask.

Further, the training process of the trained geometric constraint dynamic convolution network comprises the following steps:

constructing a training set, wherein the training set is an image of a known image instance label, and the image instance label comprises: an instance segmentation mask tag and an instance category tag;

inputting the training set into the geometric constraint dynamic convolution network, training the network, and stopping training when the total loss function value of the network is not reduced any more, so as to obtain the trained geometric constraint dynamic convolution network.

Further, the total loss function L of the network comprises four terms, which are the classification losses L respectively _cls Regression loss L of bounding box _reg Peripheral point location loss L _p And a segmentation loss L _mask The formula is as follows:

L＝L _cls +L _reg +L _p +λL _mask

wherein L is _p And L _mask Is the Dice loss, λ is set to 5.

Further, S102-1: extracting the multi-level features of the image to be segmented to obtain multi-level features, which specifically comprise:

as shown in fig. 2, the backbone network is adopted to perform multi-level feature extraction on the segmented image to obtain feature F ₁ Feature F ₂ Feature F ₃ Feature F ₄ And feature F ₅ ；

As shown in FIG. 3, feature map F ₃ Feature map F ₄ And feature map F ₅ Inputting into a feature pyramid network, and further extracting features to obtain features E ₃ Characteristics E ₄ Characteristics E ₅ Characteristics E ₆ And feature E ₇ 。

Illustratively, the backbone network can be implemented using a ResNet50 network.

Further, S102-2: performing instance sensing on the multi-level features to obtain centers, corresponding category confidence degrees and bounding boxes of all prediction instances, wherein the method specifically comprises the following steps:

will feature E ₃ Inputting the result into a classification head network in a corresponding first instance perception head network, wherein the classification head network in the first instance perception head network outputs all predicted instance centers and classification confidence coefficients thereof; at the same time feature E ₃ Inputting the boundary frames corresponding to all predicted instance centers in a boundary frame prediction head network in a corresponding first instance perception head network;

will feature E ₄ Inputting the result into a classification head network in a corresponding second instance perception head network, wherein the classification head network in the second instance perception head network outputs all predicted instance centers and classification confidence coefficients thereof; at the same time feature E ₄ Inputting the predicted boundary frames into a boundary frame prediction head network in a corresponding second instance perception head network, and predicting boundary frames corresponding to all predicted instance centers;

Will feature E ₅ Inputting the result into a classification head network in a corresponding third instance perception head network, wherein the classification head network in the third instance perception head network outputs all predicted instance centers and classification confidence coefficients thereof; at the same time feature E ₅ Inputting the predicted boundary frames into a boundary frame prediction head network in a corresponding third instance perception head network, and predicting boundary frames corresponding to all predicted instance centers;

will feature E ₆ Inputting the result into a classification head network in a corresponding fourth instance perception head network, wherein the classification head network in the fourth instance perception head network outputs all predicted instance centers and classification confidence coefficients thereof; at the same time feature E ₆ Inputting the predicted boundary frames into a boundary frame prediction head network in a corresponding fourth instance perception head network, and predicting boundary frames corresponding to all predicted instance centers;

will feature E ₇ Input into a corresponding fifth instance of the awareness head networkIn the classification head network, the classification head network in the fifth example perception head network outputs all predicted example centers and classification confidence degrees; at the same time feature E ₇ Inputting the predicted boundary frames into a boundary frame prediction head network in a corresponding fifth instance perception head network, and predicting boundary frames corresponding to all predicted instance centers;

Further, the internal structures of the first instance awareness head network, the second instance awareness head network, the third instance awareness head network, the fourth instance awareness head network and the fifth instance awareness head network are the same, and the first instance awareness head network internally comprises:

four parallel branches: the first branch is a classification head, the second branch is a bounding box prediction head, the third branch is a dynamic parameter generation head for peripheral point positioning, and the fourth branch is a dynamic parameter generation head for segmentation.

Further, as shown in fig. 4, the classification head has a network structure as follows:

four 3 x 3 convolutional layers and one 1 x 1 convolutional layer connected in sequence.

Further, the working process of the classifying head is as follows: taking the corresponding level characteristics as input, and generating a central point diagram through four 3X 3 convolution layers and one 1X 1 convolution layer; in the center point diagram, the values of all predicted instance center positions are 1 and the values of other positions are 0.

Further, as shown in fig. 5, the boundary box pre-measurement head has a network structure as follows:

Further, the boundary box pre-measuring head has the working process that: taking the corresponding level characteristics as input, and generating a boundary box corresponding to each position through four 3X 3 convolution layers and one 1X 1 convolution layer; wherein the expression of the bounding box is the distance of the four sides of the bounding box from its center point.

Further, as shown in fig. 6, the dynamic parameter generating header for peripheral point positioning has a network structure as follows: four 3 x 3 convolutional layers and one 1 x 1 convolutional layer connected in sequence.

Further, the dynamic parameter generating head for peripheral point positioning has the working process that: with the corresponding hierarchical features as input, four 3 x 3 convolutional layers and one 1 x 1 convolutional layer are subjected to generate dynamic parameters for peripheral point positioning, which are to be used to construct dynamic convolution kernels for peripheral point positioning.

Further, as shown in fig. 7, the dynamic parameter generating header for segmentation has a network structure as follows: four 3 x 3 convolutional layers and one 1 x 1 convolutional layer connected in sequence.

Further, the dynamic parameter generating head for segmentation comprises the following working procedures: with the corresponding hierarchical features as input, the dynamic parameters for segmentation are generated by passing through 4 3×3 convolutional layers and one 1×1 convolutional layer, and the dynamic parameters for segmentation are to be used for constructing a dynamic convolution kernel for segmentation.

It should be appreciated that the instance awareness header appended to each FPN layer includes: a classification header, a bounding box prediction header, and two dynamically generated headers. Wherein the classification header and the bounding box prediction header follow the design of a full convolution single-stage object detector (Fully Convolutional One-Stage Object Detector, FCOS) object detector for predicting an instance center and a corresponding bounding box, respectively.

In the method of the present invention, the bounding box is used only for Non-maximum suppression (Non-Maximum Suppression, NMS), and no RoI is generated to limit the segmented input regions, unlike the RoI-based model.

The two dynamic generation heads respectively generate parameters for constructing dynamic convolution kernels for locating peripheral points and dividing instances. The output of the dynamic parameter generation head for peripheral point location is expressed asDynamic parameter generation header for segmentation is denoted +.>H _k And W is _k Representing the spatial dimensions of the corresponding FPN layer, C _DP And C _DS Indicating the number of channels.

Further, the step S102-3: non-maximum suppression (Non-Maximum Suppression, NMS) is performed based on bounding boxes, resulting in a reserved instance center, specifically comprising:

(1) Sorting all the bounding boxes according to the classification confidence;

(2) Starting from the boundary box with the maximum confidence, judging whether the overlapping proportion of other boundary boxes and the boundary box with the maximum confidence is larger than a set threshold (the invention defaults to 0.7), and removing the boundary box with the overlapping proportion larger than the threshold;

(3) Repeating step (2) from the non-removed bounding boxes with the second confidence level until all non-removed bounding boxes are processed;

(4) After non-maximum suppression, only the instance centers where the bounding box has not been removed remain.

Further, as shown in FIG. 8, the step S102-4: extracting the bottom features from the multi-level features to obtain the bottom features, and specifically extracting the bottom features by adopting a bottom module; wherein, bottom module, its network structure is:

3×3 convolutional layers C connected in sequence ₁ Upsampling layer S ₁ Adder J ₁ 3×3 convolutional layer C ₂ Upsampling layer S ₂ Adder J ₂ 3×3 convolutional layer C ₃ 3×3 convolutional layer C ₄ 3×3 convolutional layer C ₅ 3×3 convolutional layer C ₆ And 1 x 1 convolutional layer C ₇ ；

The 3 x 3 convolution layer C ₁ For inputting features E ₅ Adder J ₁ For inputting features E ₄ Adder J ₂ For inputting features E ₃ The method comprises the steps of carrying out a first treatment on the surface of the 1X 1 convolutional layer C ₇ For outputting the bottom feature.

Wherein, bottom module, its working process is:

feature E ₅ Up-sampling to E after passing through a 3 x 3 convolutional layer ₄ And with E ₄ Adding, and up-sampling after passing through a 3×3 convolution layerSample E ₃ And with E ₃ Add to E ₃ The added result is passed through four 3 x 3 convolution layers and one 1 x 1 convolution layer to obtain the bottom characteristic

Further, the step S102-5: based on the multi-level features and each reserved example center, extracting the center features, and generating a dynamic convolution kernel for peripheral point positioning; the specific flow is as follows:

Picking features on the center points from the output DP of the dynamic parameter generation head for peripheral point positioning to generate a first set of dynamic convolution kernels (for peripheral point positioning), and then convolving the bottom features B using the first set of dynamic convolution kernels to predict a peripheral point map;

all N instance centers { c for classification head prediction _i Each of [ i ] e {1,2, …, N }, takes out values on all channels of the feature located at the same position from the output value DP of the dynamic parameter generating head for peripheral point location, expressed asAnd shaping these features by reshape operation to form a set of dynamic convolution kernels +.>The dynamic convolution kernel will be used to predict the peripheral point map.

Further, the step S102-6: performing dynamic convolution operation based on the dynamic convolution check bottom feature for peripheral point positioning, performing peripheral point diagram prediction, and generating a peripheral point diagram, wherein the method specifically comprises the following steps of:

peripheral dot diagram of prediction instance iThe whole process is formulated as follows:

wherein B represents a bottom feature, RC is a relative graph with respect to a center position, further indicating a target instance;is a 3 x 3 convolutional layer with a coefficient of expansion of 2 +.>And->Is to use dynamic convolution kernel group +.>Is a dynamic convolution of 1 x 1, cat represents parallel. P (P) _i A predictive map corresponding to one of the peripheral points for each channel of (a), expressed as

The relative graph has two channels, a first channel being the relative abscissa to the central position and a second channel being the relative ordinate to the central position.

For example, the center position is (4, 5), the two channel values at the (4, 5) position of the relative graph are (0, 0), the two channel values at the (3, 5) position are (-1, 0), the two channel values at the (4, 6) position are (0, 1), and so on.

During training, the invention uses a Gaussian heat map with peaks at peripheral points as supervision, and the sign isEach of which is->A heat map true value corresponding to one of the peripheral points. />The value of (2) may be calculated by:

where j ε {1,2,3,4} is the channel index, corresponding to one of the four peripheral points. x and y are horizontal and vertical coordinates respectively,representing the true value position of the peripheral point. Sigma (sigma) _i The standard deviation representing the gaussian peak is determined by the instance size, so a larger instance will have a larger peak. Sigma (sigma) _i The calculation can be performed by:

wherein h is _i And w _i The height and width of the example are shown, respectively, μ being used as an over-parameter for adjusting the peak size, default to 48.

It will be appreciated that in order for this prediction to be accurate, a large receptive field is required, since determining whether a point is a peripheral point requires knowledge of its location relative to the entire instance. However, ifThe large dynamic convolution kernel requires dynamic generation of more parameters (e.g., the number of parameters for a 3 x 3 convolution filter is 9 times that for a 1 x 1 convolution filter), resulting in high computational costs. A combination of two 1 x 1 dynamic convolutional layers and two 3 x 3 non-dynamic, dilated convolutional layers is selected; dynamic convolution is used to determine the target instance, and non-dynamic dilation convolution is used to locate its peripheral points; since both dynamic convolutions are 1 x 1, only few parameters are needed.

After locating the center point and the peripheral points, the next step is to select features on these keypoints. These selected features are then used to generate a dynamic convolution kernel for segmentation.

S102-7: based on the multi-level features, the reserved instance center and peripheral point diagrams, feature extraction and differential feature fusion are carried out, and a dynamic convolution kernel for segmentation is generated, which specifically comprises the following steps:

for the center point, from the output value DS of the dynamic parameter generating head for segmentation, the feature at the corresponding position is selected, expressed as

For the peripheral points, since each peripheral point corresponds to one peripheral point map, a weighted average is calculated on the output value DS of the dynamic parameter generating head for segmentation using the peripheral point map response as a weight to generate peripheral features

Wherein,is from->Generated by spatial downsampling and channel broadcasting, aligned with the dimension of the DS;

with the following componentsThe next goal is to adaptively integrate these features to maximize the ability of the generated dynamic convolution kernel to capture various example feature patterns. To achieve this objective, the present invention incorporates a DPF module, as shown in FIG. 11.

In the DPF module, the feature weights of each peripheral point depend on their differences from the central feature; after weighted averaging, the combined peripheral features are averaged with the central features and then shaped by reshape operations to obtain three dynamic convolution kernels;

defining a center featureAnd peripheral features->The difference vector between them is as follows:

wherein Θ and Φ represent two linear projection layers;

next, the disparity vector is converted into a set of weights:

where softmax is a softmax function along a first dimension, ψ is a linear projection layer; all linear projections with the same sign share weights; w (W) _i Reflecting the difference between the values of the central feature and the peripheral feature in their positions;

by multiplying the peripheral characteristic element level, the components different from the central characteristic are independently obtained; thus, by using W _i Weighted averaging of the four peripheral features obtains a fused peripheral feature:

wherein the method comprises the steps ofRepresents W _i Line j, < >>Representing element level multiplication.

Finally, a dynamic convolution kernelBy taking->And->Is then obtained by shaping the generated vector, expressed as:

where Reshape is the operation of shaping into three 1 x 1 dynamic convolution kernels,is a central feature +.>Is the peripheral feature after the differentiation feature fusion, and is given by the formula (7).

Illustratively, the reshape operation, for example, with an original feature size of 27×1, can now change the arrangement shape of these 27 numbers to the shape of three 3×3 single-channel dynamic convolution kernels.

It should be appreciated that the peripheral features can complement a variety of features other than the central feature. Thus, those features that are different from the center feature are assigned higher weights. In an implementation, the weights of the combined peripheral features are calculated based on the differences between them and the central features.

Further, as shown in FIG. 9, the step S102-8: performing dynamic convolution operation on the bottom feature based on the dynamic convolution kernel for segmentation to obtain a segmentation mask, specifically including:

Firstly, connecting the bottom characteristic and the relative coordinates relative to the central point in parallel according to channels;

then, usingThe three dynamic convolution kernels in (a) sequentially carry out convolution operation on the parallel results;

finally, the output result is up-sampled to the size of the input picture.

wherein,the mask representing the prediction, h×w, represents the spatial dimension, which is the same as the input image. Cat represents parallel per channel,/-> And->Is the corresponding dynamic convolution kernel group->Up represents upsampling.

The process of segmentation using dynamic convolution is the same as in CondInst. The bottom feature B is connected in parallel with the relative coordinates RC with respect to the center before the first dynamic convolution layer, which may provide location information to improve performance. The next two dynamic convolution layers take the output of the previous layer and do not go in parallel.

Fig. 12 (a) is an input image of the first embodiment; fig. 12 (b) is a true value of the first embodiment; fig. 12 (c) shows the prediction result of S4 Net; FIG. 12 (d) shows the result of SCG prediction; FIG. 12 (e) is a predicted result of RDPNet; fig. 12 (f) shows the prediction result of GCDCNet.

In the prior art, only the center feature of an instance is typically used to generate the dynamic convolution kernel, as the center feature is considered to represent the entire instance. However, this approach does not perform well when dealing with those instances that cover multiple different features, as the center feature can only capture one pattern. Thus, the present invention introduces the concept of "peripheral points," which are the leftmost, rightmost, uppermost, and lowermost points of an instance. These points are distributed over the example boundaries and cover modes that are different from the center feature and thus can be complementary to the center feature. At the same time, these points also provide the necessary geometric information, as they outline the general shape of the instance, ensuring the macroscopic accuracy of the segmentation result. Therefore, the KGDC mechanism of the present invention uses the central and peripheral features to generate the dynamic convolution kernel to complete the division of the instance.

Further, the trained geometric constraint dynamic convolution network comprises the following network structures:

the backbone network is used for inputting images to be segmented;

the plurality of output ends of the backbone network are connected with the plurality of input ends of the characteristic pyramid network in a one-to-one correspondence manner;

the feature pyramid network comprises a plurality of output ends, and each output end of the feature pyramid network is connected with a corresponding example perception head;

Each instance sense head includes: four parallel branches: the first branch is a classification head, the second branch is a boundary box pre-measurement head, the third branch is a dynamic parameter generation head for peripheral point positioning, and the fourth branch is a dynamic parameter generation head for segmentation;

the input ends of the four branches of each example perception head are used for inputting a feature map of a corresponding scale;

the output ends of the four branches of each example perception head are connected with the input end of the dynamic convolution module guided by key points;

the output end of the feature pyramid network is also connected with the input end of the bottom module, and the output end of the bottom module is connected with the input end of the dynamic convolution module guided by using key points;

the predicted mask is output using the output of the keypoint guided dynamic convolution module.

Further, the backbone network comprises a first feature extraction layer, a second feature extraction layer, a third feature extraction layer, a fourth feature extraction layer and a fifth feature extraction layer;

the feature pyramid network comprises a sixth feature extraction layer, a seventh feature extraction layer, an eighth feature extraction layer, a ninth feature extraction layer and a tenth feature extraction layer;

the output end of the third characteristic extraction layer is connected with the input end of the sixth characteristic extraction layer; the output end of the sixth characteristic extraction layer is connected with the input end of the first example perception head;

The output end of the fourth characteristic extraction layer is connected with the input end of the seventh characteristic extraction layer; the output end of the seventh characteristic extraction layer is connected with the input end of the second example perception head;

the output end of the fifth characteristic extraction layer is connected with the input end of the eighth characteristic extraction layer; the output end of the eighth feature extraction layer is connected with the input end of the third example perception head;

the output end of the eighth feature extraction layer is connected with the input end of the ninth feature extraction layer; the output end of the ninth feature extraction layer is connected with the input end of the fourth example perception head;

the output end of the ninth feature extraction layer is connected with the input end of the tenth feature extraction layer; an output of the tenth feature extraction layer is connected to an input of the fifth instance perception head.

Further, as shown in fig. 10, the network structure of the dynamic convolution module using the key point guidance includes:

a first input, a second input, and a third input;

the first input end is used for inputting a central point; the second input end is used for inputting an output value DP of the dynamic parameter generating head for peripheral point positioning, and the third input end is used for inputting an output value DS of the dynamic parameter generating head for segmentation;

the first input end and the second input end are connected with the input end of the parallel splicing unit, the output end of the parallel splicing unit is connected with the input end of the first dynamic convolution layer, the output end of the first dynamic convolution layer is connected with the input end of the second dynamic convolution layer, the output layer of the second dynamic convolution layer is connected with the first expansion convolution layer, the output end of the first expansion convolution layer is connected with the input end of the second expansion convolution layer, and the output end of the second expansion convolution layer outputs a peripheral dot diagram;

The output end and the third input end of the second expansion convolution layer are connected with the input end of the weighted average unit, and the output end of the weighted average unit generates peripheral characteristics;

the third input end is also connected with the input end of the selection unit, and the output end of the selection unit is used for outputting the characteristics at the selected corresponding position;

the output end of the weighted average unit and the output end of the selection unit are connected with the input end of the differential feature fusion module, and the output end of the differential feature fusion module outputs a feature fusion result.

Further, as shown in fig. 11, the differentiated feature fusion DPF (Differentiated Patterns Fusion) module includes:

a first difference calculation unit, a second difference calculation unit, a third difference calculation unit, and a fourth difference calculation unit; each difference calculating unit is used for calculating the distance between the central characteristic and the corresponding peripheral characteristic to obtain a difference vector;

the difference vectors are input to a parallel splicing unit, the output end of the parallel splicing unit is connected with the input end of the softmax function unit, and the output end of the softmax function unit outputs a weight value;

and carrying out weighted average on the weight value and the peripheral feature to obtain the peripheral feature after the differentiation feature fusion.

The invention provides an instance segmentation method based on geometric constraint dynamic convolution, which is named as a geometric constraint dynamic convolution network (Geometric Constraint-based Dynamic Convolution Network, GCDCNet), and uses key point guided dynamic convolution (Keypints-guided Dynamic Convolution, KGDC), geometric structure information is further integrated in a RoI-free dynamic convolution frame to generate a more comprehensive dynamic convolution kernel, so that diversified modes of an instance can be identified. In particular, it uses a combination of a central point and an additional peripheral point. The peripheral points are the leftmost, rightmost, uppermost, and lowermost points of the example. From a spatial geometry perspective, these points are typically farther from the center in different directions, which makes them more likely to capture the differentiated features. Furthermore, these four peripheral points may delineate the approximate coverage of the instance at relatively low computational cost, which may be considered as a complement to the center-based approach. Incorporating features at these four peripheral points into the dynamic convolution kernel may generate a more macroscopically accurate mask. Furthermore, since peripheral points are representative points on the boundary, their features can implicitly alert the convolution kernel not to produce too high a response outside the boundary, helping to accurately delineate the instance edges.

First, the backbone network and FPN perform multi-level feature extraction, followed by several instance-aware headers connected to the FPN layer, and instance-independent streams composed of bottom modules. The header connected to each FPN layer includes a sort header for centering, a bounding box pre-header, and two dynamically generated headers for generating parameters that make up the dynamic convolution kernel. At the same time use E ₃ ～E ₅ The FPN layer serves as input and the bottom module generates bottom features. The KGDC mechanism then uses the bottom features and the outputs of all the heads to predict the peripheral points and uses the features of these peripheral points as well as the features of the center point to form a convolution kernel for instance segmentation.

The invention realizes a dynamic convolution network (GCDCNet) with geometric constraint to solve the example segmentation task, and is mainly characterized in that a dynamic convolution (KGDC) mechanism guided by key points designed by the invention is used, and the high-quality overall segmentation of the example is ensured by using dynamic convolution kernels with more abundant geometric constraint and feature capacity:

the GCDCNet provided by the invention is mainly characterized by applying a key point guided dynamic convolution (KGDC) mechanism. The method integrates the characteristics of a plurality of key points to generate a dynamic convolution kernel which can comprehensively reflect the characteristics of the instance and implies geometric constraints, thereby completely and accurately dividing the instance.

The invention proposes a dynamic convolution network (GCDCNet) with geometric constraint to solve the example segmentation task. It uses a key point guided dynamic convolution (KGDC) mechanism, exploiting the features of the center and peripheral points to create a dynamic convolution kernel for the split instance. This approach allows the convolution kernel to enforce geometric constraints and build a comprehensive example paradigm. In order to maximize the information diversity accommodated in the dynamic convolution kernel, a differentiated feature fusion (DPF) module is employed to adaptively extract complementary components from the five key point features. In experiments, the present invention demonstrates the superiority of the KGDC mechanism, DPF module, and model overall architecture, with performance exceeding all previous most advanced models.

Example two

The embodiment provides an example segmentation system based on geometric constraint dynamic convolution;

an acquisition module configured to: acquiring an image to be segmented;

A segmentation module configured to: determining an image instance segmentation result corresponding to the image to be segmented based on a segmentation mask corresponding to the image instance and a category confidence coefficient in the image to be segmented;

Here, the acquiring module, the predicting module, and the dividing module correspond to steps S101 to S103 in the first embodiment, and the modules are the same as examples and application scenarios implemented by the corresponding steps, but are not limited to the disclosure of the first embodiment. It should be noted that the modules described above may be implemented as part of a system in a computer system, such as a set of computer-executable instructions.

The foregoing embodiments are directed to various embodiments, and details of one embodiment may be found in the related description of another embodiment.

The proposed system may be implemented in other ways. For example, the system embodiments described above are merely illustrative, such as the division of the modules described above, are merely a logical function division, and may be implemented in other manners, such as multiple modules may be combined or integrated into another system, or some features may be omitted, or not performed.

Example III

The embodiment also provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein the processor is coupled to the memory, the one or more computer programs being stored in the memory, the processor executing the one or more computer programs stored in the memory when the electronic device is running, to cause the electronic device to perform the method of the first embodiment.

It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate array FPGA or other programmable logic device, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may include read only memory and random access memory and provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store information of the device type.

In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or by instructions in the form of software.

The method in the first embodiment may be directly implemented as a hardware processor executing or implemented by a combination of hardware and software modules in the processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method. To avoid repetition, a detailed description is not provided herein.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

Fourth embodiment the present embodiment also provides a computer-readable storage medium storing computer instructions that, when executed by a processor, perform the method of the first embodiment.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. An example segmentation method based on geometric constraint dynamic convolution is characterized by comprising the following steps:

Acquiring an image to be segmented;

wherein, the geometrical constraint dynamic convolution network after training is used for: carrying out multi-level feature extraction on the image to be segmented to obtain multi-level features; performing instance sensing on the multi-level features to obtain centers, corresponding category confidence degrees and boundary boxes of all prediction instances; based on the bounding box, performing non-maximum suppression to obtain a reserved instance center; extracting the bottom features from the multi-level features to obtain the bottom features; based on the multi-level features and each reserved example center, extracting the center features, and generating a dynamic convolution kernel for peripheral point positioning; performing dynamic convolution operation based on the dynamic convolution check bottom features for peripheral location, performing peripheral dot pattern prediction, and generating a peripheral dot pattern; performing feature extraction and differential feature fusion based on the multi-level features, the reserved instance center and the reserved peripheral point diagram, and generating a dynamic convolution kernel for segmentation; and performing dynamic convolution operation on the bottom feature based on the dynamic convolution kernel for segmentation to obtain a segmentation mask.

2. The method for example segmentation based on geometric constraint dynamic convolution according to claim 1, wherein the extracting of the multi-level features from the image to be segmented to obtain the multi-level features specifically comprises:

multi-level feature extraction is carried out on the segmented image by adopting a backbone network to obtain a feature F ₁ Feature F ₂ Feature F ₃ Feature F ₄ And feature F ₅ The method comprises the steps of carrying out a first treatment on the surface of the And inputting the feature map F3, the feature map F4 and the feature map F5 into a feature pyramid network, and further extracting features to obtain a feature E3, a feature E4, a feature E5, a feature E6 and a feature E7.

3. The instance segmentation method based on geometric constraint dynamic convolution according to claim 1, wherein the instance perception is performed on the multi-level features to obtain centers, corresponding category confidence degrees and bounding boxes of all prediction instances, and the method specifically comprises the following steps:

the characteristics E3-E7 are respectively input into the classification head networks in the corresponding example perception head networks, and the classification head networks in the example perception head networks output all predicted example centers and classification confidence coefficients thereof; and simultaneously, the features E3-E7 are respectively input into a boundary frame prediction head network in the corresponding instance perception head network, and boundary frames corresponding to all predicted instance centers are predicted.

4. The instance segmentation method based on geometric constraint dynamic convolution as claimed in claim 3, wherein the instance aware header network internally includes:

four parallel branches: the first branch is a classification head, the second branch is a boundary box pre-measurement head, the third branch is a dynamic parameter generation head for peripheral point positioning, and the fourth branch is a dynamic parameter generation head for segmentation;

the classifying head comprises the following working processes: taking the corresponding level characteristics as input, and generating a central point diagram through four 3X 3 convolution layers and one 1X 1 convolution layer; in the center point diagram, the values of all predicted instance center positions are 1, and the values of other positions are 0;

the boundary box pre-measuring head comprises the following working processes: taking the corresponding level characteristics as input, and generating a boundary box corresponding to each position through four 3X 3 convolution layers and one 1X 1 convolution layer; the expression form of the boundary frame is the distance between four sides of the boundary frame and the center point of the boundary frame;

the dynamic parameter generating head for peripheral point positioning comprises the following working processes: taking the corresponding level characteristics as input, passing through four 3×3 convolution layers and one 1×1 convolution layer, and generating dynamic parameters for peripheral point positioning, wherein the dynamic parameters for peripheral point positioning are used for constructing dynamic convolution kernels for peripheral point positioning;

The dynamic parameter generating head for segmentation comprises the following working processes: with the corresponding hierarchical features as input, the dynamic parameters for segmentation are generated by passing through 4 3×3 convolutional layers and one 1×1 convolutional layer, and the dynamic parameters for segmentation are to be used for constructing a dynamic convolution kernel for segmentation.

5. The instance segmentation method based on geometric constraint dynamic convolution according to claim 1, wherein based on multi-level features and each reserved instance center, center feature extraction is performed to generate a dynamic convolution kernel for peripheral point positioning; the specific flow is as follows:

picking out features on the center points from the output DP of the dynamic parameter generating head for peripheral point positioning to generate a first set of dynamic convolution kernels, and then convolving the bottom features B using the first set of dynamic convolution kernels to predict a peripheral point map;

all N instance centers { c for classification head prediction _i Each of [ i ] e {1,2, …, N }, takes out values on all channels of the feature located at the same position from the output value DP of the dynamic parameter generating head for peripheral point location, expressed asAnd shaping these features by reshape operation to form a set of dynamic convolution kernels +. >The dynamic convolution kernel will be used to predict the peripheral point map.

6. The instance segmentation method based on geometric constraint dynamic convolution according to claim 1, wherein the performing a dynamic convolution operation based on a dynamic convolution check bottom feature for peripheral point positioning, performing peripheral point map prediction, and generating a peripheral point map specifically comprises:

wherein B represents a bottom feature, RC is a relative graph with respect to a center position, further indicating a target instance;is a 3 x 3 convolutional layer with a coefficient of expansion of 2 +.>And->Is to use dynamic convolution kernel group +.>Is a dynamic convolution of 1 x 1, cat representing parallel; p (P) _i A predictive map corresponding to one of the peripheral points for each channel of (a), expressed as

7. The method for partitioning an instance based on geometric constraint dynamic convolution according to claim 1, wherein feature extraction and differential feature fusion are performed based on multi-level features, a reserved instance center and a peripheral point diagram, and a dynamic convolution kernel for partitioning is generated, and specifically comprises:

the feature weights of each peripheral point depend on their differences from the central feature; after weighted averaging, the combined peripheral features are averaged with the central features and then shaped by reshape operations to obtain three dynamic convolution kernels;

wherein Θ and Φ represent two linear projection layers;

next, the disparity vector is converted into a set of weights:

Wherein the method comprises the steps ofRepresents W _i Line j, < >>Representing element-level multiplication;

where Resh is the operation of shaping into three 1 x 1 dynamic convolution kernels,is a central feature +.>Is the peripheral feature after the differentiation feature fusion, and is given by the formula (7).

8. An example segmentation system based on geometric constraint dynamic convolution, which is characterized by comprising:

an acquisition module configured to: acquiring an image to be segmented;

wherein, the geometrical constraint dynamic convolution network after training is used for: carrying out multi-level feature extraction on the image to be segmented to obtain multi-level features; performing instance sensing on the multi-level features to obtain centers, corresponding category confidence degrees and boundary boxes of all prediction instances; based on the bounding box, performing non-maximum suppression to obtain a reserved instance center; extracting the bottom features from the multi-level features to obtain the bottom features; based on the multi-level features and the reserved instance center, performing feature extraction to generate a dynamic convolution kernel for peripheral point positioning; performing dynamic convolution operation based on the dynamic convolution check bottom features for peripheral location, performing peripheral dot pattern prediction, and generating a peripheral dot pattern; performing feature extraction and differential feature fusion based on the multi-level features, the reserved instance center and the reserved peripheral point diagram, and generating a dynamic convolution kernel for segmentation; and performing dynamic convolution operation on the bottom feature based on the dynamic convolution kernel for segmentation to obtain a segmentation mask.

9. An electronic device, comprising:

a memory for non-transitory storage of computer readable instructions; and

a processor for executing the computer-readable instructions,

wherein the computer readable instructions, when executed by the processor, perform the method of any of the preceding claims 1-7.

10. A storage medium, characterized by non-transitory storage of computer readable instructions, wherein the instructions of the method of any of claims 1-7 are performed when the non-transitory computer readable instructions are executed by a computer.