CN117372444A

CN117372444A - Interactive image segmentation method and system based on light-weight adapter fine adjustment

Info

Publication number: CN117372444A
Application number: CN202311389484.9A
Authority: CN
Inventors: 陈勇全; 许龙; 李尚鸿; 徐旦; 黄志文; 黄锐; 吴均峰; 孙启霖; 许振兴; 李辉; 赵妍
Original assignee: Chinese University of Hong Kong Shenzhen; Shenzhen Institute of Artificial Intelligence and Robotics
Current assignee: Chinese University of Hong Kong Shenzhen; Shenzhen Institute of Artificial Intelligence and Robotics
Priority date: 2023-10-24
Filing date: 2023-10-24
Publication date: 2024-01-09

Abstract

The invention discloses an interactive image segmentation method and system based on fine adjustment of a lightweight adapter, wherein the method comprises the following steps: acquiring an input image and a click image, and performing image segmentation on the input image to obtain a plurality of non-overlapping image blocks; inputting the click image and the image block into a Plain ViT backbone network to perform image feature fusion, so as to obtain a multi-scale feature map of the input image; inputting the input image into a self-adaptive domain feature extractor for image feature extraction to obtain a high-quality feature map of the input image; acquiring a history prediction image and mask features, and downsampling the history prediction image to obtain a thumbnail of the history prediction image; and inputting the historical predicted image thumbnail, the mask feature, the multi-scale feature map and the high-quality feature map into a ViT backbone network adapter for image decoding to obtain a predicted image. The self-adaptive domain feature extraction network and the backbone network are combined to extract image features, so that better image segmentation effect and efficiency are obtained.

Description

Interactive image segmentation method and system based on light-weight adapter fine adjustment

Technical Field

The invention relates to the technical field of computer vision, in particular to an interactive image segmentation method and system based on fine adjustment of a lightweight adapter.

Background

The interactive segmentation task aims at segmenting target objects in an image with limited user interaction information, and the interactive segmentation can generally obtain higher precision than the traditional segmentation method under a complex scene due to the addition of additional user annotation information. The existing image segmentation technology mainly focuses on improving the performance of an algorithm on a specific data set, so that the problem of data diversity common in an actual crowdsourcing labeling task is ignored, the variable data can cause the algorithm to face cross-domain performance loss, and the efficiency of the labeling algorithm is reduced.

Accordingly, there is a need in the art for improvement.

Disclosure of Invention

The invention aims to solve the technical problem that aiming at the defects in the prior art, the invention provides an interactive image segmentation method and system based on fine adjustment of a lightweight adapter, and aims to solve the problem of poor image segmentation performance in the prior art.

The technical scheme adopted by the invention for solving the problems is as follows:

in a first aspect, an embodiment of the present invention provides an interactive image segmentation method based on lightweight adapter trimming, where the method includes:

Acquiring an input image and a click image, and performing image segmentation on the input image to obtain a plurality of non-overlapping image blocks;

inputting the click image and the image block into a Plain ViT backbone network for image feature fusion to obtain a multi-scale feature map of the input image;

inputting the input image into a self-adaptive domain feature extractor for image feature extraction to obtain a high-quality feature map of the input image;

acquiring a history prediction image and mask features, and downsampling the history prediction image to obtain a thumbnail of the history prediction image;

and inputting the historical predicted image thumbnail, the mask feature, the multi-scale feature map and the high-quality feature map into a ViT backbone network adapter for image decoding to obtain a predicted image.

In one implementation, the Plain ViT backbone network includes a first Patch embedding module, a second Patch embedding module, N transform encoder modules, and a feature pyramid module, and the inputting the click image and the image block into the Plain ViT backbone network performs image feature fusion to obtain a multi-scale feature map of the input image, including:

Inputting the click image into the first Patch embedding module for feature mapping to obtain a click image feature sequence;

inputting the image block into the second Patch embedding module for feature mapping to obtain an image block feature sequence;

sequentially inputting the click image feature sequence and the image block feature sequence into the N Transformer encoder modules to perform image encoding to obtain a global feature image of the input image;

and inputting the global feature image into the feature pyramid module to perform image feature fusion to obtain a multi-scale feature map.

In one implementation, the adaptive domain feature extractor includes a spatial prior module, an injection module, an accuracy feature extraction module, and a feature compression module, where the input image is input to the adaptive domain feature extractor to perform feature extraction, and a high quality feature map of the input image is obtained, and the method includes:

normalizing the input image;

inputting the standardized input image into the space prior module for extracting image details to obtain an output image of the space prior module;

inputting the output image of the space prior module and the global feature image into the injection module for image feature fusion to obtain a feature fusion image;

Inputting the feature fusion map into the precision feature extraction module to extract image precision features, so as to obtain a high-precision feature map;

and inputting the high-precision feature map into the feature compression module to compress the image features to obtain a high-quality feature map.

In one implementation, the ViT backbone network adapter includes a dense input embedding module, a multi-scale feature fusion module, and a bi-directional attention module, and the inputting the historical predicted image thumbnail, the mask feature, the multi-scale feature map, and the high-quality feature map into the ViT backbone network adapter decodes the image to obtain a predicted image, including:

inputting the mask feature into the bi-directional attention module using the dense input embedding module;

inputting the multi-scale feature map into the multi-scale feature fusion module for feature fusion, and inputting the fused image into the bidirectional attention module;

inputting the historical predicted image thumbnail and the high quality feature map into the bi-directional attention module;

and adopting the bidirectional attention module to carry out image decoding on the historical predicted image thumbnail, the mask feature, the multi-scale feature map and the high-quality feature map to obtain the predicted image.

In one implementation, after inputting the historical predicted image thumbnail, the mask feature, the multi-scale fusion feature map, and the high-quality feature map into the ViT backbone network adapter for image decoding, the method further comprises:

and carrying out logarithmic summation on the predicted image and the multi-scale feature map to obtain a corrected output image.

In one implementation, the spatial prior module includes a backbone layer, a convolution filter layer, and a full connection layer;

the trunk layer comprises three deformation convolution layers and a maximum pooling layer;

the backbone layer is used to capture spatial features of the input image, and the convolution filter layer is used to double the number of feature channels and reduce the size of the input image.

In one implementation, the dense input embedding module includes four two-dimensional convolution layers, a LayerNorm layer, and an activation function.

In a second aspect, an embodiment of the present invention further provides an interactive image segmentation system based on lightweight adapter trimming, where the system includes:

the first image acquisition module is used for acquiring an input image and a click image, and carrying out image segmentation on the input image to obtain a plurality of non-overlapping image blocks;

The feature fusion module is used for inputting the click image and the image block into a Plain ViT backbone network to perform image feature fusion, so as to obtain a multi-scale feature map of the input image;

the feature extraction module is used for inputting the input image into the self-adaptive domain feature extractor for image feature extraction to obtain a high-quality feature map of the input image;

the second image acquisition module is used for acquiring a historical predicted image and mask features, and downsampling the historical predicted image to obtain a thumbnail of the historical predicted image;

and the image decoding module is used for carrying out image decoding on the historical predicted image thumbnail, the mask feature, the multi-scale fusion feature map and the high-quality feature map by inputting the historical predicted image thumbnail, the mask feature, the multi-scale fusion feature map and the high-quality feature map into a ViT backbone network adapter to obtain a predicted image.

In one implementation, the feature fusion module includes:

the click image feature sequence acquisition unit is used for inputting the click image into the first Patch embedding module for feature mapping to obtain a click image feature sequence;

the image block feature sequence obtaining unit is used for inputting the image block into the second Patch embedding module for feature mapping to obtain an image block feature sequence;

The global feature image acquisition unit is used for sequentially inputting the click image feature sequence and the image block feature sequence into the N transform encoder modules to perform image encoding so as to obtain a global feature image of the input image;

the multi-scale feature map acquisition unit is used for inputting the global feature image into the feature pyramid module to perform image feature fusion, so as to obtain a multi-scale feature map.

In one implementation, the feature extraction module includes:

a normalization unit for normalizing the input image;

the image detail extraction unit is used for inputting the standardized input image into the space prior module to extract image details so as to obtain an output image of the space prior module;

the image feature fusion unit is used for inputting the output image of the space prior module and the global feature image into the injection module to perform image feature fusion to obtain a feature fusion image;

the image precision feature extraction unit is used for inputting the feature fusion map into the precision feature extraction module to extract image precision features so as to obtain a high-precision feature map;

and the image feature compression unit is used for inputting the high-precision feature map into the feature compression module to compress the image features so as to obtain a high-quality feature map.

In one implementation, the image decoding module includes:

a first input unit for inputting the mask feature into the bi-directional attention module using the dense input embedding module;

the second input unit is used for inputting the multi-scale feature map into the multi-scale feature fusion module to perform feature fusion, and inputting the fused image into the bidirectional attention module;

a third input unit for inputting the history prediction image thumbnail and the high-quality feature map into the bidirectional attention module;

and the decoding unit is used for carrying out image decoding on the historical predicted image thumbnail, the mask feature, the multi-scale feature map and the high-quality feature map by adopting the bidirectional attention module to obtain the predicted image.

In one implementation, the system further comprises:

and the image correction module is used for carrying out logarithmic summation on the predicted image and the multi-scale feature map to obtain a corrected output image.

In one implementation, the spatial prior module includes a backbone layer, a convolution filter layer, and a full connection layer; the trunk layer comprises three deformation convolution layers and a maximum pooling layer; the backbone layer is used to capture spatial features of the input image, and the convolution filter layer is used to double the number of feature channels and reduce the size of the input image.

In a third aspect, the present invention provides a terminal device, including a memory, and one or more programs, where the one or more programs are stored in the memory, and configured to be executed by the one or more processors, the one or more programs including an interactive image segmentation method for performing lightweight adapter-based fine tuning as described in any of the above.

In a fourth aspect, embodiments of the present invention also provide a non-transitory computer-readable storage medium, which when executed by a processor of an electronic device, enables the electronic device to perform the lightweight adapter trimming-based interactive image segmentation method according to any one of the above.

The invention has the beneficial effects that: compared with the prior art, the invention provides an interactive image segmentation method based on light adapter fine adjustment, which comprises the steps of firstly acquiring an input image and a click image, and carrying out image segmentation on the input image to obtain a plurality of non-overlapping image blocks; inputting the click image and the image block into a Plain ViT backbone network for image feature fusion to obtain a multi-scale feature map of the input image; further inputting the input image into a self-adaptive domain feature extractor for image feature extraction to obtain a high-quality feature map of the input image; acquiring a history prediction image and mask features, and downsampling the history prediction image to obtain a thumbnail of the history prediction image; and finally, inputting the thumbnail of the historical predicted image, the mask feature, the multi-scale feature map and the high-quality feature map into a ViT backbone network adapter for image decoding to obtain the predicted image. The self-adaptive domain feature extraction network is adopted in the application and combined with the ViT backbone network, so that the image features are better extracted, better cross-domain performance is realized, and better image segmentation effect and segmentation efficiency can be obtained.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present invention, and other drawings may be obtained according to the drawings without inventive effort to those skilled in the art.

Fig. 1 is a flowchart of an interactive image segmentation method based on light-weight adapter trimming according to an embodiment of the present invention.

Fig. 2 is an overall schematic diagram of an interactive image segmentation method based on light-weight adapter trimming according to an embodiment of the present invention.

Fig. 3 is a schematic structural diagram of an adaptive domain feature extractor of an interactive image segmentation method based on light-weight adapter trimming according to an embodiment of the present invention.

Fig. 4 is a schematic diagram of a ViT backbone network adapter of an interactive image segmentation method based on light-weight adapter trimming according to an embodiment of the present invention.

Fig. 5 is a schematic diagram of target segmentation effectiveness in a fine image domain according to an interactive image segmentation method based on light-weight adapter fine tuning according to an embodiment of the present invention.

Fig. 6 is a schematic block diagram of an interactive image segmentation system based on lightweight adapter trimming according to an embodiment of the present invention.

Fig. 7 is a schematic block diagram of an internal structure of a terminal device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more clear and clear, the present invention will be further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

It should be noted that, if directional indications (such as up, down, left, right, front, and rear … …) are included in the embodiments of the present invention, the directional indications are merely used to explain the relative positional relationship, movement conditions, etc. between the components in a specific posture (as shown in the drawings), and if the specific posture is changed, the directional indications are correspondingly changed.

In the prior art, the performance is poor during image segmentation, and the segmentation efficiency is low.

In order to solve the problems in the prior art, the embodiment provides an interactive image segmentation method based on light-weight adapter fine adjustment, and the method can realize cross-domain operation on image data and improve the performance and efficiency of algorithm segmentation. When the method is implemented, firstly, an input image and a click image are obtained, and image segmentation is carried out on the input image to obtain a plurality of non-overlapping image blocks; inputting the click image and the image block into a Plain ViT backbone network for image feature fusion to obtain a multi-scale feature map of the input image; further inputting the input image into a self-adaptive domain feature extractor for image feature extraction to obtain a high-quality feature map of the input image; acquiring a history prediction image and mask features, and downsampling the history prediction image to obtain a thumbnail of the history prediction image; and finally, inputting the historical predicted image thumbnail, the mask feature, the multi-scale feature map and the high-quality feature map into a ViT backbone network adapter for image decoding to obtain a predicted image. The self-adaptive domain feature extraction network is adopted in the application and combined with the ViT backbone network, so that the image features are better extracted, better cross-domain performance is realized, and better segmentation effect and segmentation efficiency can be obtained.

For example, the terminal device acquires an input image and a click image of a user, and the system performs preliminary image segmentation on the input image to obtain a plurality of non-overlapping image blocks; further, inputting the obtained image block and the click image into a Plain ViT backbone network for image feature fusion, so as to obtain a multi-scale feature map; inputting the input image into a self-adaptive domain feature extractor for image feature extraction to obtain a high-quality feature image of the input image; acquiring a previous historical prediction image and mask characteristics, and performing downsampling on the historical prediction image to obtain a thumbnail of the historical prediction image; finally, the historical predicted image thumbnail, the mask feature, the multi-scale feature map and the high-quality feature map obtained in the previous step are input into a ViT gird network adapter together for image decoding, and then the predicted image can be obtained. In addition, the prediction image and the multi-scale feature map can be subjected to logarithmic summation to obtain a corrected output image. By adopting the self-adaptive domain feature extraction network and combining the self-adaptive domain feature extraction network with the ViT backbone network, the image features are better extracted, the better cross-domain performance is realized, and meanwhile, the better segmentation effect and segmentation efficiency can be obtained.

Exemplary method

The embodiment provides an interactive image segmentation method based on light-weight adapter fine adjustment, which can be applied to terminal equipment. As shown in fig. 1, the method includes:

step S100, an input image and a click image are obtained, and image segmentation is carried out on the input image, so that a plurality of non-overlapping image blocks are obtained.

In this embodiment, the input to the network includes the input image xε R ^B×3×H×W Click image x _c ∈R ^B×2×H×W Wherein the click image x _c Comprises two channels respectively representing the positions of positive click and negative click, B representing the number of training samples, H representing the height of the input image, W representing the width of the input image, and x _p Representing the distribution of the positions of the target objects,i.e. the output of the initial segmentation of the segmentation network, representing the probability value, x, of whether there is a target at each location _m The final segmentation mask is represented by thresholding the variable to obtain a mask, with each value being True or False.

In specific implementation, the input image and the click image are acquired first, then the input image is subjected to image segmentation, and the input image is segmented into a series of non-overlapping image blocks with the size of 16×16 so as to be input into a Patch embedding module for feature mapping.

And step 200, inputting the click image and the image block into a Plain ViT backbone network for image feature fusion to obtain a multi-scale feature map of the input image.

In this embodiment, the Plain ViT backbone network includes a first Patch embedding module, a second Patch embedding module, N transform encoder modules, and a feature pyramid module.

In implementation, as shown in fig. 2, the click image is input into the first Patch embedding module to perform feature mapping to obtain a click image feature sequence, image blocks of the input image after image segmentation are input into the second Patch embedding module to perform feature mapping to obtain an image block feature sequence, and features of each image block can be mapped into feature vectors with a dimension of C through linear transformation of the two Patch embedding modules to form a feature sequence with a length of L; then sequentially inputting the click image feature sequence and the image block feature sequence into N stacked transform encoder modules together for image encoding to obtain a global feature image of the input image, wherein each transform encoder module comprises a multi-head self-attention layer, and the output result of the feature vector sequence after passing through the nth layer is that The output after the last layer is +.>The output result comprisesThe strongest global features of the image; finally, inputting the global feature image into the feature pyramid module for image feature fusion to obtain a multi-scale feature image, wherein the feature pyramid module is formed by a group of convolution and deconvolution, and the feature pyramid module can obtain the feature image with the size which is equal to that of the original image by setting different convolution steps>Is marked as +.>The multi-size object in the input image can be better processed by the algorithm through multi-size processing, and the more robust multi-scale object segmentation performance is realized through learning object representations of different scales.

And step S300, inputting the input image into an adaptive domain feature extractor for image feature extraction, and obtaining a high-quality feature map of the input image.

In this embodiment, the adaptive domain feature extractor includes a spatial prior module, an injection module, an accuracy feature extraction module, and a feature compression module. The spatial prior module comprises a trunk layer, a convolution filter layer and a full connection layer, and captures local pixel association and spatial context relation of an image through deformation convolution. The trunk layer comprises three deformation convolution layers and a maximum pooling layer, and is used for capturing the spatial characteristics of the input image. And the batch normalization layer and the ReLU generalization function are used between the convolution layers to improve the generalization capability. The convolution filter layer is used to double the number of characteristic channels and reduce the size of the input image. And extracting image characteristics of the input image through the adaptive domain characteristic extractor so as to obtain a high-quality characteristic diagram of the input image.

In practice, in order to obtain more accurate segmentation results, we need to obtain rich global semantic context information and local edge detail information, so an adaptive domain feature extractor is introduced to obtain more spatial prior information and edge detail information,a specific structure of the adaptive domain feature extractor is shown in fig. 3. Firstly, normalizing the input image to obtain a normalized image x _in The method comprises the steps of carrying out a first treatment on the surface of the Then inputting the standardized input image into the space prior module for extracting image details to obtain an output image of the space prior module; inputting the output image of the space prior module and the global feature image into the injection module for image feature fusion to obtain a feature fusion image; inputting the feature fusion map into a precision feature extraction module for extracting image precision features to obtain a high-precision feature map; and inputting the high-precision feature map into the feature compression module to compress the image features to obtain a high-quality feature map. In the space prior module, the input image is input into a trunk layer, and is output as a result after three-layer convolution processing and maximum pooling layer of the trunk layer The backbone layer reduces the spatial scale of the image and maps channels to hidden dimensionsThen the output of the main layer is input into the convolution filter layer to further double the characteristic channel number, the size of the image is reduced, and the output of the main layer is +.>The convolution filter layer is formed by deformation convolution with the cascade step length of 2 and the convolution kernel of 3 multiplied by 3; finally, the output of the convolution filter layer is input into the fully connected layer to project the feature map to the feature dimension C of ViT, and the output image of the space priori module is obtained as +.>The full connection layer is formed by deformation convolution with a convolution kernel of 1 multiplied by 1. After the input image passes through the blankAfter the inter-prior module is processed, an injection module is adopted to perform feature fusion on the output image of the space prior module and the global feature map output by the transducer encoder module, specifically, the space prior feature is based on cross attention>Features in early layers with the Plain ViT backbone network +.>Fusion is performed. ViT shows that there is a longer attention distance in later blocks of the backbone network and more localization in earlier blocks, containing semantically lower detail features, from which we extract ViT the output of early attention blocks of the backbone network as features, specifically we select the output of the 3 rd block of a total of 12 blocks, denoted- >Will->As a query, the spatial prior feature +.>As a key and value, cross-attention was used to introduce spatial features into the early ViT feature, as shown in equation (1),

wherein A (-) represents the cross-attention layer, the input of position one as the query, the input of position two as the key and value, norm (-) represents the LayerNorm layer,is a learnable parameter for correlating the output of the attention layer in the residual connection, which hasThe initial value of 0, since the attention layer output is non-zero, the parameter gamma will be optimized to be non-zero in the first iteration of the gradient descent, in this way the fusion efficiency of the adaptive domain module and the ViT backbone network can be gradually balanced in a learning manner. After the spatial feature is introduced into the early ViT feature, a precision feature extraction module is applied to extract the high-precision feature, wherein the precision feature extraction module consists of a feedforward network (FFN) and a cross attention layer, and the process of extracting the high-precision feature is shown in the following formulas (2) and (3):

wherein the output of the feed forward networkOutput of the upper layer as query of the cross-attention layer +.>Keys and values as cross-attention layers; finally, a characteristic compression module is designed, and the output of the cross attention layer is compressed through two deconvolution layers >Increasing the dimension of the feature map to obtain the final high quality feature map +.>

And step 400, acquiring a history prediction image and mask features, and downsampling the history prediction image to obtain a thumbnail of the history prediction image.

In the present embodiment, the history prediction image is the previous prediction imageAnd t represents the t-th click, and downsampling processing is carried out after the history prediction image is acquired, so that the history prediction image thumbnail is obtained.

And S500, inputting the thumbnail of the historical predicted image, the mask feature, the multi-scale feature map and the high-quality feature map into a ViT backbone network adapter for image decoding to obtain the predicted image.

In this embodiment, the ViT backbone network adapter includes a dense input embedding module, a multi-scale feature fusion module, and a bi-directional attention module. The dense input embedding module comprises four two-dimensional convolution layers, a LayerNorm layer and an activation function, and can input the mask features into the ViT backbone network adapter to enhance the stability of model prediction results and prevent the prediction mask from changing greatly in the process of two continuous clicks, and the dense input embedding module can reduce the size of the mask features to be original And the dimension mapped to the feature is C, the dense input embedding module is connected with the image feature obtained by the Plain ViT backbone network, and the dense input embedding module is input into the bidirectional attention module for decoding.

In implementation, we introduce an efficient Token learning manner into the ViT backbone network adapter to improve the ability of the ViT backbone network adapter to learn high-quality domain information and prediction masks, as shown in fig. 4, the Token is first decoded by two decoder layers, in each decoding layer, the Token is first self-care updated, then feature updated by a bidirectional Token-to-image and image-to-Token cross-attention layer, and after passing through the decoder layers, the output Token is associated with global image context features; finally, a three-layer MLP is added, and the updated output Token can predict dynamic MLP weights and generate dynamic convolution kernels therefrom. Furthermore, we introduce in the ViT backbone network adapterAnd a multi-scale feature fusion module is introduced, the multi-scale features obtained based on a simple feature pyramid network in the Plain ViT backbone network are subjected to image fusion and then are input into the ViT backbone network adapter for image decoding processing, and according to the ViT \circuit { touvron2021 tracking } study, the simple feature pyramid structure can effectively extract vision-specific induction deviation from the backbone network. A multi-scale feature fusion layer is designed in the ViT backbone network adapter, and for features of different scales, deconvolution layers with different step sizes are adopted for compression, so that feature dimensions are compressed to Unifying the feature size to the input image size +.>Finally, after the compressed multi-scale features are connected in the dimension of the features, the output dimension of the multi-scale features is +.>Is characterized by (3).

In one implementation, the mask feature is input to the bi-directional attention module using the dense input embedding module; inputting the multi-scale feature map into the multi-scale feature fusion module for feature fusion, and inputting the fused image into the bidirectional attention module; inputting the historical predicted image thumbnail and the high quality feature map into the bi-directional attention module; and adopting the bidirectional attention module to carry out image decoding on the historical predicted image thumbnail, the mask feature, the multi-scale feature map and the high-quality feature map to obtain the predicted image. In the model training process, we fix model parameters of the pre-trained ViT model while only making the parameters of the proposed domain adaptive adapter learnable, so that the learnable parameters only include the convolution layer and cross-attention layer in the adaptive domain feature extractor, the output Token in the adapter, the three-layer MLP associated therewith, the downsampled convolution layer of the dense input embedding module, and the two-layer decoder layer.

In one implementation, after inputting the thumbnail of the historical predicted image, the mask feature, the multi-scale fusion feature map, and the high-quality feature map into the ViT backbone network adapter for image decoding, the method further comprises: and carrying out logarithmic summation on the predicted image and the multi-scale feature map to obtain a corrected output image. During reasoning, the prediction from the adapter is used as a high-quality prediction result, the output result of the ViT through the common semantic segmentation head and the adapter high-quality mask prediction are subjected to logarithmic summation, so that correction of the output prediction mask is realized, and the corrected result is subjected to up-sampling to obtain final output.

To verify the effectiveness of the present invention, we evaluated the invention on multiple benchmarks and the experimental results are shown in Table 1. As can be seen from table 1, the present invention shows the best performance among algorithms that participate in the comparison. By adopting the scheme of the invention, the interactive segmentation efficiency is effectively improved. Overall, the interactive segmentation algorithm provided by the invention achieves advanced performance.

TABLE 1 comparison of the Performance of the invention with the prior art results

In the invention, the technical characteristics mainly comprise the performance of the lifting algorithm on the cross-domain data set, including the performance on the fine data segmentation task. In this technical feature, the present invention exhibits an effect better than that of the reference algorithm, as shown in fig. 5. As can be seen from fig. 5, the algorithm provided in the present invention can effectively improve the performance of the algorithm on the fine target segmentation task, and illustrates the effectiveness of the proposed scheme. The domain self-adaptive model provided by the invention can learn the representation of the specific domain style well and realize the efficient segmentation labeling on the cross-domain data by utilizing the domain self-adaptive network to learn the domain characteristics in the cross-domain data, wherein the domain characteristics comprise the texture details of the image, the color style and the like. The domain adaptive model is deployed in the decoder head, and further comprises in the image feature extraction network.

Exemplary System

As shown in fig. 6, an embodiment of the present invention provides an interactive image segmentation system based on lightweight adapter trimming, the system comprising: a first image acquisition module 10, a feature fusion module 20, a feature extraction module 30, a second image acquisition module 40, and an image decoding module 50. Specifically, the first image obtaining module 10 is configured to obtain an input image and a click image, and perform image segmentation on the input image to obtain a plurality of non-overlapping image blocks. The feature fusion module 20 is configured to input the click image and the image block into a Plain ViT backbone network for image feature fusion, so as to obtain a multi-scale feature map of the input image. The feature extraction module 30 is configured to input the input image into an adaptive domain feature extractor for image feature extraction, so as to obtain a high-quality feature map of the input image. The second image obtaining module 40 is configured to obtain a history prediction image and a mask feature, and downsample the history prediction image to obtain the history prediction image thumbnail. The image decoding module 50 is configured to input the historical predicted image thumbnail, the mask feature, the multi-scale fusion feature map and the high-quality feature map into a ViT backbone network adapter for image decoding, so as to obtain a predicted image.

In one implementation, the feature fusion module includes:

In one implementation, the feature extraction module includes:

a normalization unit for normalizing the input image;

In one implementation, the image decoding module includes:

In one implementation, the system further comprises:

Based on the above embodiment, the present invention also provides a terminal device, and a schematic block diagram of the terminal device may be shown as 7. The terminal device may comprise one or more processors 100 (only one shown in fig. 7), a memory 101 and a computer program 102 stored in the memory 101 and executable on the one or more processors 100, for example a program of an interactive image segmentation method based on lightweight adapter trimming. The execution of the computer program 102 by one or more processors 100 may implement the various steps in an embodiment of an interactive image segmentation method based on lightweight adapter trimming. Alternatively, the functions of the modules/units in the embodiments of the lightweight adapter-based interactive image segmentation system may be implemented by one or more processors 100 when executing computer program 102, without limitation.

In one embodiment, the processor 100 may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

In one embodiment, the memory 101 may be an internal storage unit of the electronic device, such as a hard disk or a memory of the electronic device. The memory 101 may also be an external storage device of the electronic device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) card, a flash card (flash card) or the like, which are provided on the electronic device. Further, the memory 101 may also include both an internal storage unit and an external storage device of the electronic device. The memory 101 is used to store computer programs and other programs and data required by the terminal device. The memory 101 may also be used to temporarily store data that has been output or is to be output.

It will be appreciated by those skilled in the art that the functional block diagram shown in fig. 7 is merely a block diagram of a portion of the structure associated with the present inventive arrangements and is not limiting of the terminal device to which the present inventive arrangements are applied, and that a particular terminal device may include more or less components than those shown, or may combine some of the components, or have a different arrangement of components.

Those skilled in the art will appreciate that implementing all or part of the above-described methods may be accomplished by way of a computer program, which may be stored on a non-transitory computer readable storage medium, that when executed may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, operational database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual operation data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

In summary, the invention discloses an interactive image segmentation method based on light adapter fine adjustment, which comprises the steps of firstly acquiring an input image and a click image, and carrying out image segmentation on the input image to obtain a plurality of non-overlapping image blocks; inputting the click image and the image block into a Plain ViT backbone network for image feature fusion to obtain a multi-scale feature map of the input image; further inputting the input image into a self-adaptive domain feature extractor for image feature extraction to obtain a high-quality feature map of the input image; acquiring a history prediction image and mask features, and downsampling the history prediction image to obtain a thumbnail of the history prediction image; and finally, inputting the thumbnail of the historical predicted image, the mask feature, the multi-scale feature map and the high-quality feature map into a ViT backbone network adapter for image decoding to obtain the predicted image. The self-adaptive domain feature extraction network is adopted in the application and combined with the ViT backbone network, so that the image features are better extracted, better cross-domain performance is realized, and better segmentation effect and segmentation efficiency can be obtained.

It is to be understood that the invention is not limited in its application to the examples described above, but is capable of modification and variation in light of the above teachings by those skilled in the art, and that all such modifications and variations are intended to be included within the scope of the appended claims.

Claims

1. An interactive image segmentation method based on light-weight adapter fine adjustment, which is characterized by comprising the following steps:

2. The interactive image segmentation method based on light-weight adapter fine tuning of claim 1, wherein the Plain ViT backbone network comprises a first Patch embedding module, a second Patch embedding module, N transform encoder modules and a feature pyramid module, the inputting the click image and the image block into the Plain ViT backbone network for image feature fusion, obtaining a multi-scale feature map of the input image comprises:

3. The method for segmenting an interactive image based on fine adjustment of a lightweight adapter according to claim 2, wherein the adaptive domain feature extractor comprises a spatial prior module, an injection module, a precision feature extraction module and a feature compression module, the input image is input into the adaptive domain feature extractor for feature extraction, and a high-quality feature map of the input image is obtained, and the method comprises the following steps:

Normalizing the input image;

4. The interactive image segmentation method based on lightweight adapter fine tuning according to claim 1, wherein the ViT backbone network adapter comprises a dense input embedding module, a multi-scale feature fusion module and a bi-directional attention module, wherein the image decoding the historical predicted image thumbnail, the mask feature, the multi-scale feature map and the high-quality feature map input ViT backbone network adapter to obtain a predicted image comprises:

5. The interactive image segmentation method based on lightweight adapter trimming according to claim 1, wherein after the historical predicted image thumbnail, the mask feature, the multi-scale fusion feature map and the high-quality feature map are input into a ViT backbone network adapter for image decoding, the method further comprises:

6. The interactive image segmentation method based on light-weight adapter fine tuning of claim 3, wherein the spatial prior module comprises a trunk layer, a convolution filter layer and a full connection layer;

7. The method of claim 4, wherein the dense input embedding module comprises four two-dimensional convolution layers, a LayerNorm layer, and an activation function.

8. An interactive image segmentation system based on lightweight adapter trimming, the system comprising:

9. A terminal device comprising a memory, a processor and a lightweight-adapter-trim-based interactive image segmentation program stored in the memory and executable on the processor, the processor implementing the lightweight-adapter-trim-based interactive image segmentation method as set forth in any one of claims 1-7 when the lightweight-adapter-trim-based interactive image segmentation program is executed by the processor.

10. A computer-readable storage medium, wherein the computer-readable storage medium has stored thereon an interactive image segmentation program based on lightweight adapter trimming, which when executed by a processor, implements the steps of the lightweight adapter trimming-based interactive image segmentation method according to any one of claims 1-7.