CN117314938B

CN117314938B - Image segmentation method and device based on multi-scale feature fusion decoding

Info

Publication number: CN117314938B
Application number: CN202311529949.6A
Authority: CN
Inventors: 马腾辉; 李叶; 许乐乐; 徐金中; 郭丽丽
Original assignee: Technology and Engineering Center for Space Utilization of CAS
Current assignee: Technology and Engineering Center for Space Utilization of CAS
Priority date: 2023-11-16
Filing date: 2023-11-16
Publication date: 2024-04-05
Anticipated expiration: 2043-11-16
Also published as: CN117314938A

Abstract

An image segmentation method and device based on multi-scale feature fusion decoding in an embodiment of the disclosure, the method comprises: acquiring a multi-scale feature map of an original image; upsampling the multiscale feature map to obtain an upsampled feature map, fusing the multiscale feature map and the upsampled feature map to obtain a multiscale fused feature map, and sequentially encoding the multiscale fused feature map to generate a multiscale embedded tensor; decoding the multi-scale embedded tensor to obtain a multi-scale mask tensor and a multi-scale key corner tensor; re-decoding the multi-scale embedded tensor to obtain a multi-scale contour tensor; and splicing the multi-scale mask tensor, the multi-scale key corner tensor and the multi-scale contour tensor into a multi-scale fusion query quantity, and encoding the multi-scale fusion query quantity to obtain a final image segmentation result. The method analyzes the mask, the outline and the key corner points of the local features of the global features, performs multi-scale feature fusion decoding, and improves the image instance segmentation precision.

Description

Image segmentation method and device based on multi-scale feature fusion decoding

Technical Field

The embodiment of the disclosure relates to the technical field of computer vision, in particular to an image segmentation method, an image segmentation device, computer equipment and a computer readable storage medium based on multi-scale feature fusion decoding.

Background

Image instance segmentation is an important task in the field of computer vision, aimed at separating and marking out different object instances in an image. The technology has wide application prospect in the fields of automatic driving, medical image processing, video monitoring and the like. Conventional image instance segmentation methods typically use manually designed features and classifiers that have limited effectiveness in addressing complex instance segmentation problems. In recent years, development of deep learning techniques has advanced rapid progress in the field of image instance segmentation. The deep learning models such as Convolutional Neural Network (CNN) can extract high-level features from the images, so that the example segmentation task is more accurate and robust. However, the instance segmentation task remains challenging due to the different sizes, shapes, and complexities of object instances in the image. In the existing deep learning model, all details and features of an object instance cannot be captured based on feature extraction of a single scale, and the segmentation accuracy of the existing deep learning model needs to be further improved.

Disclosure of Invention

An object of an embodiment of the present disclosure is to provide an image segmentation method, apparatus, computer device and computer readable storage medium based on multi-scale feature fusion decoding, so as to solve the foregoing problems in the prior art.

In order to achieve the above objective, the technical solution adopted in the embodiments of the present disclosure is as follows:

an aspect of an embodiment of the present disclosure provides an image segmentation method based on multi-scale feature fusion coding, the method including:

acquiring a multi-scale feature map of an image to be segmented;

performing up-sampling on the minimum-scale feature map in the multi-scale feature map for multiple times to obtain a multi-scale up-sampling feature map, fusing the multi-scale feature map with the up-sampling feature map of the corresponding scale to obtain a multi-scale fused feature map, and sequentially encoding the multi-scale fused feature map to generate a multi-scale embedded tensor;

decoding the multi-scale embedded tensor according to the learnable query quantity to obtain a multi-scale mask tensor and a multi-scale key corner tensor;

re-decoding the multi-scale embedded tensor by taking the multi-scale mask tensor as a query quantity to obtain a multi-scale contour tensor;

and splicing the multi-scale mask tensor, the multi-scale key corner tensor and the multi-scale contour tensor into a multi-scale fusion query volume, and encoding the multi-scale fusion query volume to obtain a final image segmentation result.

Illustratively, the acquiring the multi-scale feature map of the image to be segmented includes:

acquiring an original image to be segmented;

and carrying out convolution calculation and downsampling on the original image in sequence by adopting a maximum pooling method to obtain a multi-scale feature map.

Exemplary, the performing up-sampling on the minimum scale feature map of the multi-scale feature map for multiple times to obtain a multi-scale up-sampling feature map, and fusing the multi-scale feature map with the up-sampling feature map of the corresponding scale to obtain a multi-scale fused feature map, including:

continuously upsampling the minimum scale feature map of the multi-scale feature map for multiple times to obtain multi-scale upsampled feature maps, the number of which is the same as that of the multi-scale feature maps;

and respectively superposing the multi-scale feature map and the up-sampling feature map with corresponding scales, and performing convolution smoothing on the superposed multi-scale feature map to obtain a multi-scale fusion feature map.

Illustratively, the encoding the multi-scale fusion feature map sequentially generates a multi-scale embedded tensor, including:

respectively carrying out self-attention calculation on the multi-scale fusion feature images to obtain corresponding initial embedded tensors;

and respectively carrying out two linear transformations on the multi-scale initial embedded tensor, and carrying out nonlinear ReLU activation in the middle of the two linear transformations to generate a final multi-scale embedded tensor.

Illustratively, decoding the multi-scale embedded tensor according to the learnable query volume to obtain a multi-scale mask tensor and a multi-scale key corner tensor, including:

performing self-attention computation and nonlinear transformation on the multiscale embedded tensor respectively to obtain corresponding first output, wherein the query quantity, key and value of the self-attention computation are all corresponding embedded tensors;

performing cross attention calculation and nonlinear transformation on the first output to obtain a corresponding second output; the query quantity in the cross attention calculation is a parameter quantity which can be learned, and the key and the value are the first output for carrying out self attention calculation corresponding to the multi-scale embedded tensor;

and respectively carrying out dot product operation on the second output and the fusion feature map with the largest scale to obtain a multi-scale mask tensor and a multi-scale key corner tensor.

Illustratively, the re-decoding the multi-scale embedded tensor by using the multi-scale mask tensor as a query quantity to obtain a multi-scale contour tensor includes:

and respectively taking the multi-scale mask tensors as query quantities, wherein keys and values correspond to the multi-scale embedded tensors, respectively carrying out cross attention calculation on the multi-scale embedded tensors, carrying out nonlinear transformation on cross attention calculation output, and carrying out dot product operation on a nonlinear transformation result and a fused feature map with the largest scale to obtain the multi-scale contour tensor.

Illustratively, the stitching the multi-scale mask tensor, the multi-scale key corner tensor, and the multi-scale contour tensor into a multi-scale fusion query volume, and encoding the multi-scale fusion query volume to obtain a final image segmentation result, includes:

respectively splicing the multi-scale mask tensor, the multi-scale key corner tensor and the multi-scale outline tensor to obtain a multi-scale fusion query quantity;

and performing self-attention calculation and nonlinear transformation on the multi-scale fusion query volume, performing dot product operation with the fusion feature map with the largest scale to obtain segmentation results with different scales, and accumulating the segmentation results with different scales to obtain a final image instance segmentation result.

Another aspect of the disclosed embodiments provides an image segmentation apparatus based on multi-scale feature fusion coding, the apparatus comprising:

the feature extraction network is used for acquiring images and extracting a multi-scale feature map;

the encoder is used for carrying out up-sampling on the multi-scale feature map for multiple times to obtain a corresponding up-sampling feature map, fusing the multi-scale feature map with the up-sampling feature map of a corresponding scale, and sequentially encoding the fused multi-scale feature map to generate a multi-scale embedded tensor;

the multi-scale feature decoder is used for decoding the multi-scale embedded tensor according to the learnable query quantity to obtain a multi-scale mask tensor and a multi-scale key corner tensor;

the contour decoder is used for re-decoding the multi-scale embedded tensor by taking the multi-scale mask tensor as a query quantity to obtain a multi-scale contour tensor;

the fusion decoder is used for splicing the multi-scale mask tensor, the multi-scale key corner tensor and the multi-scale outline tensor into multi-scale fusion query volume, and encoding the multi-scale fusion tensor to obtain a final image segmentation result.

Another aspect of the disclosed embodiments provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method as described above when executing the program.

Another aspect of the disclosed embodiments provides a computer readable storage medium having stored thereon a computer program which when executed by a processor implements a method as described above.

The beneficial effects of the embodiment of the disclosure are that:

according to the image instance segmentation method based on multi-scale feature fusion decoding, global features such as masks, outlines and key corner points of local features are analyzed, multi-scale feature fusion decoding is carried out, and image instance segmentation accuracy is improved. The method disclosed by the invention is simple and convenient to operate and good in segmentation effect.

Drawings

FIG. 1 is a schematic flow diagram of an image segmentation method based on multi-scale feature fusion decoding according to an embodiment of the disclosure;

FIG. 2 is a schematic structural diagram of an image segmentation apparatus based on multi-scale feature fusion decoding according to an embodiment of the disclosure;

fig. 3 is a workflow diagram of an image segmentation apparatus based on multi-scale feature fusion coding in accordance with an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the embodiments of the present disclosure will be further described in detail below with reference to the accompanying drawings. It should be understood that the detailed description is merely illustrative of the disclosed embodiments and is not intended to limit the disclosed embodiments.

As shown in fig. 1, an embodiment of the present disclosure proposes an image segmentation method based on multi-scale feature fusion decoding, where the method includes:

and S1, acquiring a multi-scale feature map of the image to be segmented.

As an example, the acquiring a multi-scale feature map of an image to be segmented includes:

step S11, obtaining an original image to be segmented.

Step S12, carrying out convolution calculation on the original image in sequence and adopting a maximum pooling method to carry out downsampling

The sample is used for obtaining a multi-scale characteristic diagram, and the specific formula is as follows:

D _n+1 ＝f _downsample (f _Conv (D _n ),n＝0,1,2,3

wherein f _downsample () Representing the downsampling process, f _Conv Representing the convolution process, D ₀ D is the original image ₁ ～D ₄ A feature map of progressively decreasing size is obtained for each of four consecutive downsampling processes.

The maximum pooling method in the step S12 collects the information of the local areas by downsampling the local areas in the input feature map, so that the number of parameters is reduced, the complexity of calculation is reduced, the calculation cost of training and reasoning is reduced, the risk of fitting is reduced, key information in the input feature map is reserved, and important features can be better identified and learned.

And S2, carrying out up-sampling on the minimum scale feature map in the multi-scale feature map for a plurality of times to obtain a multi-scale up-sampling feature map, fusing the multi-scale feature map with the up-sampling feature map of the corresponding scale to obtain a multi-scale fused feature map, and sequentially encoding the multi-scale fused feature map to generate a multi-scale embedded tensor.

As an example, the performing up-sampling on the minimum scale feature map of the multi-scale feature map for multiple times to obtain a multi-scale up-sampling feature map, and fusing the multi-scale feature map with the up-sampling feature map of a corresponding scale to obtain a multi-scale fused feature map, including:

step 21, performing continuous multiple upsampling on the minimum scale feature map of the multi-scale feature map to obtain multi-scale upsampled feature maps with the same number as the multi-scale feature maps, which specifically includes the following steps:

U _n+1 ＝f _upsample (U _n ),n＝0,1,2,3

wherein f _upsample () Representing the upsampling process, U ₀ Feature map D, which is the smallest dimension in a multi-scale feature map ₄ ，U ₁ ～U ₄ For carrying out up sampling for four times continuously, an up-sampling characteristic diagram with gradually increased size is obtained;

step 22, respectively superposing the multi-scale feature map and the up-sampling feature map with corresponding scale, and performing convolution smoothing on the superposed multi-scale feature map to obtain a multi-scale fusion feature map, wherein the steps are as follows:

C _n ＝f _Conv3×3 (U _n +D _5-n ),n＝1,2,3,4

wherein f _Conv3×3 Representing a 3 x 3 convolution calculation, C ₁ ～C ₄ And (3) superposing and smoothing the multi-scale feature map and the up-sampling feature map with the corresponding scale to obtain a fusion feature map.

In the embodiment of the disclosure, in the process of overlapping the multi-scale feature map and the corresponding up-sampling feature map, up-sampling operation with transverse connection is used respectively, and in order to eliminate the problem of insufficient fusion possibly caused by direct addition of corresponding elements of the two feature maps, smoothing processing is performed on the feature map after fusion by using a 3×3 convolution, so that a fused feature map with more sufficient fusion is obtained.

As an example, the encoding the multi-scale fusion feature map sequentially generates a multi-scale embedded tensor, including:

step 23, fusing the characteristic graphs C with multiple scales _n (n=1, 2,3, 4) respectively performing self-attention calculation to obtain corresponding initial embedding tensor Z ₁ ～Z ₄ The self-attention function formula is:

wherein Attention () is an Attention function, d _k Representing the query quantity Q _n Key K _n Value V _n Dimension size; query quantity Q _n Key K _n Value V _n Are all nth fusion feature map C _n Tensors of (c). Attention function Attention () can be described as mapping the query quantity Q and a set of key-value K-V pairs to an output, resulting in a fused feature map C _n The score of each pixel position is the corresponding initial embedded tensor Z _n To capture long-range dependencies between different locations of the image.

Step (a)24. For multiscale initial embedding tensor Z _n Respectively performing two linear transformations, and generating a final embedded tensor Z 'through nonlinear ReLU activation in the middle' _n The method specifically comprises the following steps:

FFN(Z _n )＝max(0,Z _n W ₁ +b ₁ )W ₂ +b ₂ ，(n＝1,2,3,4)

wherein FFN () represents the nonlinear ReLU activation computation between two linear transforms, W ₁ 、b ₁ 、W ₂ 、b ₂ All represent parameters.

Initial embedding tensor Z in embodiments of the present disclosure ₁ ～Z ₄ Through two linear transformations and through a nonlinear ReLU activation in between for initial embedding tensor Z ₁ ～Z ₄ Non-linear transformation and mapping to generate a final output embedded tensor Z' ₁ ～Z’ ₄ The model expression and generalization capability are increased, so that the performance is improved.

And step S3, decoding the multi-scale embedded tensor according to the learnable query quantity to obtain a multi-scale mask tensor and a multi-scale key corner tensor.

Step S3 of an embodiment of the present disclosure is to interact and integrate features of different scales to capture global context information.

As an example, the decoding the multi-scale embedded tensor according to the learnable query volume to obtain a multi-scale mask tensor and a multi-scale key corner tensor includes:

step S31, embedding tensor Z 'for multiple scales' _n Respectively performing self-attention calculation and nonlinear transformation to obtain corresponding first output Z _sn The specific formula is as follows:

Z _sn ＝FFN(Attention(Z’ _n ,Z’ _n ,Z’ _n )) n＝1,2,3,4

step S31 of the presently disclosed embodiments queries the quantity Q 'in the self-attention computation' _n Bond K' _n Value V' _n Are all embedded tensors Z' _n Correspondingly generate Z _s1 ～Z _s4 。

Step S32, for the first output Z _sn Respectively performing cross attention calculation and nonlinear transformation to obtain corresponding second output Z _c1 ～Z _c4 The specific formula is as follows:

Z _cn ＝FFN(Attention(Q _sn ,Z _sn ,Z _sn )) n＝1,2,3,4

wherein the query quantity Q _sn Is [100, b,256 ]]Is a learnable parameter number, key K _sn Value V _sn Embedding tensors Z 'for corresponding multiscales' _n Output Z for self-attention calculation _sn 。

Query volume Q in embodiments of the present disclosure _sn Is [100, b,256 ]]Wherein b is the number of input images per batch, and each 256-dimensional vector represents detected box information consisting of class information for distinguishing classes and spatial information (box coordinates) describing the position of the object in the images.

Step S33, outputting the second output Z _c1 ～Z _c4 Respectively with the largest scale fusion feature map C ₄ Dot product operation is carried out to obtain a multi-scale mask tensor V _mn And multiscale key corner tensor V _pn The method specifically comprises the following steps:

V _mn ,V _pn ＝torch.mul(Z _cn ,C ₄ )n＝1,2,3,4

where torch.mul () represents a dot product operation.

And S4, decoding the multi-scale embedded tensor again by taking the multi-scale mask tensor as a query quantity to obtain a multi-scale contour tensor.

As an example, the re-decoding the multi-scale mask tensor as a query quantity to obtain a multi-scale contour tensor includes

Respectively by multiscale mask tensor V _mn As a query volume, a tensor Z 'is embedded for multiple scales' ₁ ～Z’ ₄ Respectively performing cross attention calculation, performing nonlinear transformation on the cross attention calculation output, and combining the nonlinear transformationFusion characteristic diagram C with largest fruit and scale ₄ Dot product operation is carried out to obtain a multi-scale contour tensor V _rn The method specifically comprises the following steps:

V _rn ＝torch.mul(FFN(Attention(V _mn ,Z’ _n ,Z’ _n )),C ₄ )n＝1,2,3,4

wherein the key, value of the cross-attention computation is the corresponding multi-scale embedded tensor.

In step S4 of the embodiment of the disclosure, the multi-scale mask is used as the query quantity, so that the quality of the query quantity is improved, the perception of global features is increased, and the decoding capability is improved.

And S5, splicing the multi-scale mask tensor, the multi-scale key corner tensor and the multi-scale outline tensor into multi-scale fusion query volume, and encoding the multi-scale fusion query volume to obtain a final image segmentation result.

As an example, the stitching the multi-scale mask tensor, the multi-scale key corner tensor, and the multi-scale contour tensor into a multi-scale fusion query volume, and encoding the multi-scale fusion query volume to obtain a final image segmentation result, includes:

respectively divide the multi-scale mask tensor V _mn Multi-scale key corner tensor V _pn Multiscale contour tensor V _rn Splicing to obtain multi-scale fusion query quantity B _n The splicing formula is specifically as follows:

B _n ＝Contact(V _mn ,V _pn ,V _rn )，n＝1,2,3,4

where Contact represents tensor stitching.

For the multiscale fusion query volume B _n Performing self-attention calculation and nonlinear transformation, establishing association and interaction between global features and local features, and performing dot product operation with fused feature graphs with maximum scales to obtain segmentation results M with different scales _n For M _n The final image example segmentation result is obtained through accumulation, specifically:

M _n ＝torch.mul(FFN(Attention(B _n ,Z’ _n ,Z’ _n )),C ₄ ) n＝1,2,3,4

result＝torch.add(M _n ) n＝1,2,3,4

where torch.add () represents the accumulation calculation.

The embodiment of the disclosure relates to an image instance segmentation method based on multi-scale feature fusion decoding, which is used for acquiring a multi-scale feature map from an input image; performing cross-layer superposition and encoding on the multi-scale feature map to generate a multi-scale embedded tensor; decoding from the multiscale embedded tensor through the learned query quantity to obtain a multiscale mask tensor and a multiscale key corner tensor; re-decoding the multi-scale embedded tensor by taking the multi-scale mask tensor as a query quantity to obtain a multi-scale outline tensor; and splicing the multi-scale mask tensor, the key corner tensor and the outline tensor into a fusion tensor, and encoding the fusion tensor to obtain a final image instance segmentation result. According to the image instance segmentation method based on multi-scale feature fusion decoding provided by the embodiment of the disclosure, global features (masks and contours) and local features (local key corner points) are analyzed, multi-scale feature fusion decoding is carried out, a false segmentation area is removed, a missing part of masks is filled, the saw-tooth effect of segmentation boundaries is reduced, and the image instance segmentation precision is improved.

As shown in fig. 2 and 3, another aspect of the embodiments of the present disclosure provides an image segmentation apparatus based on multi-scale feature fusion decoding, the apparatus including: feature extraction network 100, encoder 200, multi-scale feature decoder 300, contour decoder 400, and fusion decoder 500.

The feature extraction network 100 is configured to obtain a multi-scale feature map of an image to be segmented. The feature extraction network may include at least 1 convolution layer and 1 pooling layer; the convolution layer in the feature extraction network is used for obtaining an original image to be segmented, the pooling layer carries out downsampling on the original image by adopting a maximum pooling method after carrying out convolution calculation, and a multi-scale feature map is obtained, and the specific implementation method is as follows:

D _n+1 ＝f _downsample (f _Conv (D _n ),n＝0,1,2,3

Downsampling to realize dimension reduction processing of the feature map; the pooling layer collects the information of the local areas in the input feature map by sampling the local areas, so that the number of parameters is reduced, the complexity of a model is reduced, the calculation cost of training and reasoning is reduced, the risk of overfitting is reduced, key information in the input feature map is reserved, and the segmentation device can better identify and learn important features.

The feature extraction network in the disclosed implementation example may be a residual network or a transform feature extraction network.

The feature extraction network is used as a backbone network for acquiring a multi-scale feature map of the image, and is used for processing the multi-scale problem in image instance segmentation.

The encoder 200 is configured to upsample the minimum-scale feature map of the multi-scale feature map multiple times to obtain a multi-scale upsampled feature map, fuse the multi-scale feature map with the upsampled feature map of the corresponding scale to obtain a multi-scale fused feature map, and sequentially encode the multi-scale fused feature map to generate a multi-scale embedded tensor.

The encoder may be a multi-scale deformable self-attention encoder comprising at least 1 base transducer layer, 1 upsampling layer, 1 superimposed layer, employing a self-attention mechanism; the up-sampling layer of the encoder is used for up-sampling the small-scale feature map of the image, and one or more up-sampling methods such as nearest interpolation, bilinear interpolation, transposition convolution and the like can be adopted for up-sampling; the up-sampling layer carries out continuous up-sampling for a plurality of times on the minimum scale feature map of the multi-scale feature map to obtain multi-scale up-sampling feature maps with the same number as the multi-scale feature map, and the method comprises the following steps:

U _n+1 ＝f _upsample (U _n ),n＝0,1,2,3

wherein f _upsample () Representing the upsampling process, U ₀ Feature map D, which is the smallest dimension in a multi-scale feature map ₄ ，U ₁ ～U ₄ For four consecutive upsampling, an upsampled feature map of progressively larger size is obtained.

The superposition layer of the encoder is used for respectively superposing and smoothing the multi-scale feature map and the up-sampling feature map with corresponding scales to obtain an image multi-scale feature fusion feature map; the method comprises the following steps:

C _n ＝f _Conv3×3 (U _n +D _5-n ),n＝1,2,3,4

wherein f _Conv3×3 Representing a 3 x 3 convolution calculation, C ₁ ～C ₄ And (3) superposing and smoothing the multi-scale feature map and the corresponding up-sampling feature map to obtain a fusion feature map.

Each superimposed layer is respectively provided with an up-sampling layer which is transversely connected, the number of network layers is increased by the up-sampling layer, the receptive field is increased, the expression capacity of the model is increased, and meanwhile, the size of the feature map is increased; meanwhile, in order to eliminate the problem of insufficient fusion possibly caused by direct addition of corresponding elements of the two feature images, smoothing processing is performed on the feature images after fusion by using 3X 3 convolution, so that a fused feature image with more sufficient fusion is obtained.

The basic transducer layer of the encoder comprises at least 1 self-attention module and 1 feedforward neural network, wherein the number of the heads in the basic transducer layer is at least 1, the self-attention module is used for capturing long-distance dependency relations between different positions of an image, and the self-attention module fuses the characteristic diagram C for multiple scales _n Respectively performing self-attention calculation to obtain corresponding initial embedded tensor Z _n The self-attention calculation formula is:

wherein, attention () is an Attention function, and the query quantity Q _n Key K _n Value V _n Are all nth fusion feature map C _n Tensor, d _k Representing the query quantity Q _n Key K _n Value V _n Dimension size.

Feedforward neural network of encoder embeds tensor Z to multiscale initiation _n Performing two linear transformations, and performing nonlinear ReLU activation in the middle of the two linear transformations to generate a final multiscale embedded tensor Z' _n The method specifically comprises the following steps:

FFN(Z _n )＝max(0,Z _n W ₁ +b ₁ )W ₂ +b ₂

wherein FFN () represents the nonlinear ReLU activation computation between two linear transforms, W ₁ 、b ₁ 、W ₂ 、b ₂ All are indicated as parameters.

The feedforward neural network is used for embedding tensor Z in the initial stage ₁ ～Z ₄ Performing nonlinear transformation and mapping to generate final output, and increasing model expression and generalization capability, thereby improving performance; the encoder encodes the multi-scale feature fusion feature map using a self-attention module and a feedforward neural network to generate a multi-scale embedded tensor.

The encoder of the embodiment of the disclosure carries out up-sampling on the small-scale feature map through the up-sampling layer, then stacks the up-sampled feature map and the multi-scale feature map by utilizing the stacking layer to realize fusion of the multi-scale feature map, and then calculates a multi-scale embedded tensor by utilizing the self-attention module aiming at the fused multi-scale feature map so as to capture long-distance dependency relations among different positions, so that the method can better process object examples with different scales and shapes in the image.

The multi-scale feature decoder 300 is configured to decode the multi-scale embedded tensor according to a learnable query volume to obtain a multi-scale mask tensor and a multi-scale key corner tensor.

The multi-scale feature decoder may include at least 1 layer of DetrTransformer decoding layers having at least 1 header number, the DetrTransformer decoding layers including at least 1 self-attention module, 1 cross-attention module, and 1 feedforward neural networkThe self-attention module of the decoder decodes the multi-scale embedded tensor according to the learned query quantity, and the self-attention module and the feedforward neural network pair the multi-scale embedded tensor Z' _n Respectively performing self-attention calculation and nonlinear transformation to obtain corresponding first output Z _sn The specific formula is as follows:

Z _sn ＝FFN(Attention(Z’ _n ,Z’ _n ,Z’ _n )) n＝1,2,3,4

query volume Q 'in self-attention computation of embodiments of the present disclosure' _n Bond K' _n Value V' _n Are all embedded tensors Z' _n Correspondingly generate Z _sn 。

Cross-attention module and feedforward neural network pair first output Z of multi-scale feature decoder _sn Performing cross attention calculation and nonlinear transformation to obtain corresponding second output Z _cn The specific formula is as follows:

Z _cn ＝FFN(Attention(Q _sn ,Z _sn ,Z _sn )) n＝1,2,3,4

wherein the query quantity Q _sn Is [100, b,256 ]]Is a learnable query quantity, key K _sn Value V _sn To embed tensors Z 'for multiple scales' _n Output Z for self-attention calculation _sn 。

The cross-attention module of the multi-scale feature decoder outputs a second output Z _cn Fused feature map C with largest scale ₄ Dot product operation is carried out to obtain a multi-scale mask tensor V _mn And multiscale key corner tensor V _pn The method specifically comprises the following steps:

V _mn ,V _pn ＝torch.mul(Z _cn ,C ₄ ) n＝1,2,3,4

where torch.mul () represents a dot product operation.

The multi-scale feature decoder of the disclosed embodiments extracts local features by constraining cross-attention (cross-attention module) within the foreground region of the prediction mask for each query volume, resulting in multi-scale mask tensors and multi-scale key-point tensors.

The multi-scale feature decoder of the embodiment of the disclosure decodes from the multi-scale embedded tensor according to the learned query quantity to obtain a multi-scale mask tensor and a multi-scale key corner tensor; the self-attention module and the cross-attention module of the multi-scale feature decoder are used for interacting and integrating features with different scales and capturing global context information. The query volume is a learnable embedded tensor, which can inject the information of the target category into the self-attention module, so that the method can obtain faster convergence speed and better performance.

The contour decoder 400 is configured to re-decode the multi-scale embedded tensor by using the multi-scale mask tensor as a query quantity to obtain a multi-scale contour tensor.

The contour decoder may include at least 1 layer of DetrTransformer decoding layers, the number of heads in the DetrTransformer decoding layers being at least 1, the DetrTransformer decoding layers including at least 1 self-attention module, 1 cross-attention module and 1 feedforward neural network, the contour decoder re-decodes the multi-scale embedded tensor with the multi-scale mask tensor as a query volume, in particular, the cross-attention module of the contour decoder re-decodes the multi-scale embedded tensor with the respective multi-scale mask V _mn As a query volume, a tensor Z 'is embedded for multiple scales' ₁ ～Z’ ₄ Respectively performing cross attention calculation, inputting the cross attention calculation output to a feedforward neural network to perform nonlinear transformation, and fusing the nonlinear transformation result with the largest scale of the feature graph C ₄ Dot product operation is carried out on the cross attention module to obtain a multi-scale contour tensor V _rn The method specifically comprises the following steps:

V _rn ＝torch.mul(FFN(Attention(V _mn ,Z’ _n ,Z’ _n )),C ₄ ) n＝1,2,3,4

The contour decoder takes the multi-scale mask tensor as the query quantity, improves the quality of the query quantity, increases the perception of the model on the global characteristics, and improves the decoding capability.

The contour decoder of the embodiment of the disclosure uses the multi-scale mask tensor as a query quantity to decode the multi-scale embedded tensor again to obtain the multi-scale contour tensor; the cross attention module takes a multi-scale mask as the query quantity, improves the quality of the query quantity, increases the perception of the model on the global features, and improves the decoding capability. The feed-forward neural network is used to perform nonlinear transformation and mapping on the cross-attention computation results to generate final outputs that help model learn task-specific representations, thereby improving performance.

The fusion decoder 500 is configured to splice the multi-scale mask tensor, the multi-scale key corner tensor, and the multi-scale contour tensor into a fusion query volume, and encode the fusion tensor by using an encoder network structure to obtain a final image segmentation result.

The fusion decoder can be a self-attention encoder, comprises at least 1 layer of basic Transformer layer, and adopts a self-attention encoding mechanism; the number of heads in the base transducer layer is at least 1, including at least 1 self-attention module and 1 feedforward neural network. The self-attention module of the fusion decoder respectively processes the multi-scale mask tensor V _mn Tensor of key corner V _pn Contour tensor V _rn Spliced into multi-scale fusion query quantity B _n The splicing formula is specifically as follows:

B _n ＝Contact(V _mn ,V _pn ,V _rn )，n＝1,2,3,4

where Contact represents tensor stitching.

The self-attention module of the fusion decoder is used for establishing a query-key-value relation based on the fusion query quantity, capturing multi-scale and multi-type information in the image and obtaining a final image instance segmentation result.

The self-attention module and the feedforward neural network of the fusion decoder pair the multiscale fusion query quantity B _n Performing self-attention calculation and nonlinear transformation, establishing association and interaction between global features and local features, and performing dot product operation with fused feature graphs with maximum scales to obtain segmentation results M with different scales _n For M _n Accumulating to obtain final image instance segmentation nodesThe fruit is specifically as follows:

M _n ＝torch.mul(FFN(Attention(B _n ,Z’ _n ,Z’ _n )),C4) n＝1,2,3,4

result＝torch.add(M _n ) n＝1,2,3,4

where torch.add () represents the accumulation calculation.

The fusion decoder of the embodiment of the disclosure splices the multi-scale mask tensor, the key corner tensor and the outline tensor into the fusion query volume, and the self-attention module is used for establishing association and interaction between the global features and the local features so as to help the model to better understand semantic association and correlation of input data. And the fusion decoder encodes the fusion tensor by utilizing the encoder network structure to obtain a final image instance segmentation result.

The embodiment of the disclosure relates to an image instance segmentation device based on multi-scale feature fusion decoding, which comprises a feature extraction network, an encoder, a multi-scale feature decoder, a contour decoder and a fusion decoder; the feature extraction network is used for acquiring a multi-scale feature map from an input image; the encoder carries out cross-layer superposition and encoding on the multi-scale feature map to generate a multi-scale embedded tensor; the multi-scale feature decoder decodes the multi-scale embedded tensor through the learned query quantity to obtain a multi-scale mask tensor and a multi-scale key corner tensor; and the contour decoder re-decodes the multi-scale embedded tensor by taking the multi-scale mask tensor as a query quantity to obtain the multi-scale contour tensor. And the fusion decoder is used for splicing the multi-scale mask tensor, the key corner tensor and the outline tensor into a fusion tensor, and the fusion tensor is encoded by the fusion decoder by utilizing the encoder network structure to obtain a final image instance segmentation result.

Another aspect of the disclosed embodiments provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the image segmentation method as described above when executing the program.

Another aspect of the disclosed embodiments provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the image segmentation method as described above.

Wherein a computer readable storage medium may be any tangible medium that can contain, or store a program that can be an electronic, magnetic, optical, electromagnetic, infrared, semiconductor system, apparatus, device, more specific examples of which include, but are not limited to: a connection having one or more wires, a portable computer diskette, a hard disk, an optical fiber, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination thereof. The computer-readable storage medium may also include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein, specific examples of which include, but are not limited to, electromagnetic signals, optical signals, or any suitable combination thereof.

The foregoing is merely a preferred implementation of the embodiments of the disclosure, and it should be noted that, for a person skilled in the art, several improvements and modifications may be made without departing from the principles of the embodiments of the disclosure, which should also be considered as protective scope of the embodiments of the disclosure.

Claims

1. An image segmentation method based on multi-scale feature fusion decoding, the method comprising:

acquiring a multi-scale feature map of an image to be segmented;

the encoding of the multi-scale fusion feature map in turn generates a multi-scale embedded tensor, comprising: respectively carrying out self-attention calculation on the multi-scale fusion feature images to obtain corresponding initial embedded tensors;

respectively carrying out two linear transformations on the multi-scale initial embedding tensor, and carrying out nonlinear ReLU activation in the middle of the two linear transformations to generate a final multi-scale embedding tensor;

decoding the multi-scale embedded tensor according to the learnable query quantity to obtain a multi-scale mask tensor and a multi-scale key corner tensor, wherein the method specifically comprises the following steps of: performing self-attention computation and nonlinear transformation on the multi-scale embedded tensors respectively to obtain corresponding first output, wherein the query quantity, key and value of the self-attention computation are all corresponding multi-scale embedded tensors;

respectively carrying out dot product operation on the second output and the fusion feature map with the largest scale to obtain a multi-scale mask tensor and a multi-scale key corner tensor;

re-decoding the multi-scale embedded tensor by taking the multi-scale mask tensor as a query quantity to obtain a multi-scale contour tensor, wherein the method specifically comprises the following steps of: respectively taking the multi-scale mask tensor as query quantity, wherein keys and values correspond to the multi-scale embedded tensor, respectively carrying out cross attention calculation on the multi-scale embedded tensor, carrying out nonlinear transformation on cross attention calculation output, and carrying out dot product operation on a nonlinear transformation result and a fused feature map with the largest scale to obtain a multi-scale contour tensor;

splicing the multi-scale mask tensor, the multi-scale key corner tensor and the multi-scale contour tensor into a multi-scale fusion query volume, and encoding the multi-scale fusion query volume to obtain a final image segmentation result, wherein the method specifically comprises the following steps of:

2. The method of claim 1, wherein the acquiring the multi-scale feature map of the image to be segmented comprises:

acquiring an original image to be segmented;

3. The method according to claim 1 or 2, wherein the upsampling the minimum-scale feature map of the multi-scale feature map multiple times to obtain a multi-scale upsampled feature map, and fusing the multi-scale feature map with the upsampled feature map of the corresponding scale to obtain a multi-scale fused feature map, includes:

4. An image segmentation apparatus based on multi-scale feature fusion coding, the apparatus comprising:

the feature extraction network is used for acquiring images and extracting multi-scale feature images;

the coder is used for carrying out up-sampling on the multi-scale feature map for a plurality of times to obtain a corresponding up-sampling feature map, fusing the multi-scale feature map with the up-sampling feature map with a corresponding scale, and sequentially coding the fused multi-scale feature map to generate a multi-scale embedded tensor; the encoding of the multi-scale fusion feature map in turn generates a multi-scale embedded tensor, comprising: respectively carrying out self-attention calculation on the multi-scale fusion feature images to obtain corresponding initial embedded tensors;

the multi-scale feature decoder is configured to decode the multi-scale embedded tensor according to a learnable query quantity to obtain a multi-scale mask tensor and a multi-scale key corner tensor, and specifically includes: performing self-attention computation and nonlinear transformation on the multi-scale embedded tensors respectively to obtain corresponding first output, wherein the query quantity, key and value of the self-attention computation are all corresponding multi-scale embedded tensors;

the contour decoder is configured to re-decode the multi-scale embedded tensor by using the multi-scale mask tensor as a query quantity to obtain a multi-scale contour tensor, and specifically includes: respectively taking the multi-scale mask tensor as query quantity, wherein keys and values correspond to the multi-scale embedded tensor, respectively carrying out cross attention calculation on the multi-scale embedded tensor, carrying out nonlinear transformation on cross attention calculation output, and carrying out dot product operation on a nonlinear transformation result and a fused feature map with the largest scale to obtain a multi-scale contour tensor;

the fusion decoder is used for splicing the multi-scale mask tensor, the multi-scale key corner tensor and the multi-scale outline tensor into multi-scale fusion query quantity, and encoding the multi-scale fusion tensor to obtain a final image segmentation result, and specifically comprises the following steps:

5. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any one of claims 1 to 3 when the program is executed by the processor.

6. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements a method according to any one of claims 1-3.