CN117372701A

CN117372701A - Interactive image segmentation method based on Transformer

Info

Publication number: CN117372701A
Application number: CN202311667809.5A
Authority: CN
Inventors: 何一凡; 陈盼盼; 王大寒; 江楠峰; 吴芸; 王驰明; 朱顺痣; 于金喜
Original assignee: Xiamen Ruiwei Information Technology Co ltd; Xiamen University of Technology
Current assignee: Xiamen Ruiwei Information Technology Co ltd; Xiamen University of Technology
Priority date: 2023-12-07
Filing date: 2023-12-07
Publication date: 2024-01-09
Anticipated expiration: 2043-12-07
Also published as: CN117372701B

Abstract

The invention discloses a method for segmenting an interactive image based on a transducer, which comprises the steps of selecting an image to be annotated, and loading the image to be annotated into interactive image segmentation annotation software; selecting a segmentation target, generating a click record according to the click action of a user, and generating a click mark at a corresponding position; after the interaction is confirmed, converting the interaction into a circular graph according to the click record, splicing the circular graph serving as a corresponding positive and negative click guide with an original mask, and adding the circular graph to the original graph to serve as a segmentation model to input; dividing a designated target in the image by using a pre-training dividing model, and returning to an initial dividing mask; selecting to add proper positive and negative clicks to mark the error region again according to the initial segmentation mask result; the new marker is fed into the segmentation model again and the corrected result is returned. And in this way, the segmentation result is refined to obtain a satisfactory result. The invention can improve the segmentation labeling performance of the interactive image and obtain better segmentation results with fewer interaction times.

Description

Interactive image segmentation method based on Transformer

Technical Field

The invention relates to the technical field of computer vision and interactive segmentation labeling, in particular to a interactive image segmentation method based on a Transformer.

Background

In the era of explosive development of large models, deep learning technology has undergone revolutionary development, mainly benefiting from large-scale and high-performance computer resources and huge data volumes. Large models exhibit excellent capabilities in terms of handling complex tasks and extracting high-level features, the performance of which is largely due to the successful application of deep learning methods for large amounts of training data. With the advent of large models, the need for high quality training data sets has also grown exponentially in areas such as semantic segmentation, instance segmentation, and salient object detection. However, the labeling of these datasets becomes increasingly cumbersome and expensive, particularly in the pixel-level labeling task. Therefore, the interactive image segmentation algorithm shows important research and application values, and more researchers are widely explored in the field. The interactive image segmentation algorithms become the optimal choice for reducing the manual labeling cost, and the introduction of the algorithms enables the labeling training data set to be more convenient and efficient, so that the rapid development of a large model can be promoted. While pursuing high precision and intelligent operation, these algorithms provide a practical solution to the challenge of scaling the labeling data. Through decades of development of interactive image segmentation technology, rich theories and methods are accumulated, and a relatively complete algorithm system and interaction framework are formed. The interaction mode mainly comprises the following steps: click, graffiti, bounding box, and outline. Interactive segmentation methods can be divided into two main categories, traditional and deep learning-based. Conventional approaches typically utilize underlying features such as pixel values and relationships between them to interact. The conventional method does not need to train a model, and thus cannot learn the existing experience. Therefore, this type of method requires a lot of labor cost to obtain a good segmentation effect. With the rapid development of deep learning, convolutional neural networks replace the traditional method, but interactive segmentation models based on the transformers exhibit more superior performance than convolutional neural networks. Although the above-described algorithm has achieved significant success in the task of interactive image segmentation, some limitations remain with existing segmentation methods. The method comprises the problems that the interaction mode is not flexible enough, the reflection of the user intention is not accurate enough, the segmentation precision is not high enough, more user interactions are needed, and the like. Therefore, these interactive image segmentation methods have shortcomings in facing natural images with complex backgrounds, silhouette-blurred medical images. Based on this, the present invention starts with improvements to the interactive segmentation model in these respects.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a Transformer-based interactive image segmentation method which can improve the segmentation labeling performance of an interactive image, can show good generalization on a hospital image and can obtain better segmentation results with fewer interaction times.

In order to achieve the above object, the solution of the present invention is:

the interactive image segmentation method based on the Transformer comprises the following steps of:

s1, selecting an image to be marked, and loading the image to the interactive image segmentation marking software;

s2, selecting a segmentation target by a user, starting with left click, generating a record according to click behaviors of the user after starting, and generating a click mark at a corresponding position;

s3, after the interaction is confirmed, converting the interaction into a circular graph according to the click record, splicing the circular graph with an original mask as a corresponding positive and negative click guide, and finally adding the circular graph with the original mask as a segmentation model to input;

s4, segmenting an appointed target in the image by utilizing a pre-training segmentation model, and returning to an initial segmentation mask;

s5, selecting to add proper positive and negative clicks to mark the error region again according to the initial segmentation mask result;

and S6, sending the new mark into the segmentation model again, returning a corrected result, and refining the segmentation result to obtain a satisfactory result in a reciprocating manner.

Further, in step S1, the image to be marked is of any size, and is loaded into the interactive image segmentation labeling software, and after the interaction is completed, the interactive image segmentation labeling software adaptively adjusts the size of the image to the uniform size 448 x 448, so as to ensure that the size requirement of the segmentation model input is met.

Further, in step S2, after clicking the menu bar start button, the user continuously monitors the mouse event, and when the user presses the left key at a certain position, the coordinate information of the position is recorded, and a green dot is generated at the corresponding position to indicate that the position is located inside the segmented object, that is, the segmented target object (foreground); when released, it indicates that the one-time marking process is finished, and similarly, when the user presses the right key at a certain position, the same operation is performed, except that a red dot is generated at this time to indicate that the position is outside the segmented object, i.e., the non-segmented target object (background).

Further, in step S2, the radius of the small dot is 5 pixels. When the radius is too large, positive/negative clicks at the foreground edge may be mistaken for negative/positive clicks, ambiguity is generated on the interactive information, so that model recognition errors generate poor segmentation results, and when the radius is too small, the range covered by the clicks is too small to provide rich information, so that a user may need to provide more interactive information, which violates the initial center of interactive segmentation.

In step S3, the click record generated in step S2 is regarded as an inner area/outer area of the segmentation target, a circular disc graph with a radius of 5 pixels is generated according to the click coordinate transformation and used as positive and negative click guidance, if the initial segmentation is performed, an image with a size of 448 x 448 pixels value of 0 is generated as an initial segmentation mask, and the positive and negative click guidance and the initial segmentation mask are spliced to form a three-channel image with a size of 3 x 448 x 488.

Further, the 3-channel original RGB image of 448 x 448 is added element by element with the 3 x 448 x 448 image formed by the positive and negative clicks and the initial segmentation mask, and is input as a segmentation model.

Further, in step S4, when the segmentation model is trained, since the deep learning model needs a large amount of labeling data, and millions of the deep learning model are easy, if the manual click interaction labeling is adopted, the cost is too high, so we use the analog sampling strategy to generate the positive and negative click guidance.

In step S4, the segmentation model is a reconstruction click model (Interactive Segmentation with Reconstruct Click Vision Transformers) using a transducer as a backbone, and the reconstruction click model mainly includes a reconstruction click image embedding module and a multi-scale adaptive fusion module, which are used for enhancing the learning ability of the model to obtain a precise segmentation result.

Further, the reconstructed click image embedding module performs feature separation and reconstruction according to importance of different clicks so as to enhance feature representation of the important clicks, and the specific method is as follows:

first, the contribution of different clicks is evaluated with scaling factors in the group normalization (Group Normalization) layer, for a given feature imageWherein R is a four-dimensional tensor, +.>For the number of batches>For the number of channels>Is the image size;

input features are first normalized by a simple normalization operationAs shown in formula 1-1:

wherein the method comprises the steps ofMean and variance, respectively, ++>Is trainable parameter->The magnitude of the value reflects the extent of contribution of the spatial pixel information,/and>the rationality of division is ensured for constants with smaller values and larger than zero; />Representing +.>Performing group normalization operation;

normalizing the correlation weightsRepresenting differentThe importance of the feature map is shown in formula 1-2:

wherein the method comprises the steps ofIs->Normalized weights of the individual channels, +.>Is->Weights of individual channels->Weighting and summing weights of all channels;

the normalized weights are then controlled to be in the range of 0-1 by the Sigmoid function, and then the above weights are distinguished by a gating mechanism (threshold is set to 0.5). Wherein the weight equal to or greater than the threshold value (0.5) is considered as contributing moreA less than threshold (0.5) is considered to contribute less, denoted +.>As shown in equations 1-3:

wherein the method comprises the steps ofFor the purpose of parameter->Threshold gating of (2), wherein the default thresholdThe value size was 0.5. By comparison->And a threshold size to distinguish its extent of contribution;

then, the input features X are multiplied byAnd->Obtaining the characteristics with different contribution degrees>And->；

Finally, the weighted two different information features are fused, as shown in formulas 1-4:

further, the multi-scale adaptive fusion module learns how to filter the feature space at other levels so as to preserve useful spatial information adaptive combination, and the specific method is as follows:

first, features of different resolutions generated by pyramids are expressed asWherein->Express hierarchy, let->The representation is located at the +.>Feature vector at from level->Resizing to level->In the hierarchy->The fusion pattern of (2) is shown in formulas 1-5:

wherein the method comprises the steps ofRepresenting post-fusion spatial position->Feature vector at>、/>Four different levels to level for model adaptive learning>Spatial importance weight of>And is also provided withAs shown in equations 1-6:

further, in step S5, the segmentation model returns the mask result of the first segmentation to the front end for display, and may be further modified according to the segmentation result, and the positive and negative clicks are added to the wrong place of the segmentation mask for marking, where the vicinity of the edge of the segmented object is marked, and after the marking is completed, the step S3 is repeated to convert the new positive and negative clicks again to generate a positive and negative click guidance.

Further, in step S6, the newly generated positive and negative click guidance and the segmentation mask that was just returned to the front end are subjected to a stitching operation, then added element by element with the original image, and then sent to the model for re-prediction, and finally the corrected segmentation mask is returned to the front end, so that interactive correction is iterated until a satisfactory segmentation mask is obtained.

After the scheme is adopted, the interactive image segmentation method based on the Transformer provided by the invention aims at neglecting contribution degrees of different clicks to segmentation results of an interactive segmentation model, and designs a reconstructed click model (Interactive Segmentation with Reconstruct Click Vision Transformers) aiming at inconsistent conflicts caused by multi-scale fusion, so that the model can fully utilize click interaction information, and the scale invariance of the features is improved. The method achieves the most advanced performance on the interactive image segmentation labeling task and also shows good generalization performance on medical images. Better segmentation results are obtained through fewer interaction times, and the method has commercial application value.

Compared with the prior art, the invention has the following beneficial effects:

(1) The invention designs a brand-new interactive segmentation model based on a transducer.

(2) The invention effectively enables the segmentation model to fully utilize click interaction information, and strengthens the influence of interaction guidance on the segmentation result.

(3) The invention effectively solves the inconsistency caused by the fusion of the multi-scale features and improves the scale deformation of the features.

(4) The interactive image segmentation method based on the transducer provided by the invention shows good generalization performance on medical data sets.

Drawings

Fig. 1 is a flow chart of the interactive image segmentation method based on the Transformer.

Fig. 2 is a model diagram of the interactive image segmentation method based on the transducer of the present invention.

FIG. 3 is a schematic diagram of the present invention regarding the embedding of reconstructed click images.

FIG. 4 is a schematic diagram of the architecture of the present invention with respect to multi-scale adaptive fusion.

Fig. 5 is a schematic diagram of a segmentation result of the interactive image segmentation method based on the transducer of the present invention.

Detailed Description

In order to further explain the technical scheme of the invention, the invention is explained in detail by specific examples.

As shown in fig. 1 to 5, the present invention provides a transform-based interactive image segmentation method for interactively labeling a person with segmented images, comprising the following steps:

and S1, selecting an image to be marked, and loading the image to the interactive image segmentation marking software. The user can select any size of the picture, the picture is loaded into the interactive image segmentation software, and after the interaction is completed, the picture is adaptively adjusted to the uniform size 448 x 448, so that the size requirement of model input is met.

And S2, selecting a segmentation target by a user, starting with left click, generating a record according to the click action of the user after starting, and generating a click mark at a corresponding position.

After clicking the start button of the menu bar on the right side, the user can monitor the mouse event continuously, when the user presses the left key at a certain position, the coordinate information of the position can be recorded, and a green dot is generated at the corresponding position to indicate that the position is positioned in the segmented object, namely the segmented target object (foreground). When released, this indicates that the one-time marking process is over. Also, when the user presses the right key at a certain position, the same operation is performed, except that a red dot is generated at this time to indicate that the position is outside the segmented object, i.e., the non-segmented target object (background).

According to experiments and statistics, the radius of the small dots is 5 pixels. When the radius is too large, positive/negative clicks at the foreground edge can be mistaken for negative/positive clicks, and ambiguity is generated on interaction information, so that model identification errors are caused, and poor segmentation results are generated. When the radius is too small, the range covered by the click is too small to provide rich information, resulting in the user possibly needing to provide more interactive information, which violates the initial focus of the interactive segmentation.

And S3, after the interaction is confirmed, converting the interaction into a circular graph according to the click record, splicing the circular graph serving as a corresponding positive and negative click guide with an original mask, and finally adding the circular graph with the original mask to serve as a segmentation model to input.

The previous click record is regarded as an inner area/outer area of the segmentation target, and a circular disc graph with a radius of 5 pixels is generated according to click coordinate conversion and used as positive and negative click guidance respectively. In the case of the initial segmentation, an image with 448 x 448 pixels each having a value of 0 is generated as an initial segmentation mask. The positive and negative click guides and the initial segmentation mask are spliced to form a three-channel image of 3 x 448 x 488.

The 3-channel original RGB image of size 448 x 448 is added element-wise to the 3 x 448 x 448 image formed by the positive and negative clicks and the initial segmentation mask. And then the segmentation model is input.

And S4, segmenting the appointed target in the image by utilizing the pre-training segmentation model, and returning to the initial segmentation mask.

When the segmentation model is trained, a large amount of annotation data is needed by the deep learning model, millions of people are easy to use, and if manual click interaction annotation is adopted, the cost is too high. Therefore, we employ an analog sampling strategy to generate positive and negative click guides.

We designed a reconstructed click model (Interactive Segmentation with Reconstruct Click Vision Transformers) with a transducer backbone as shown in fig. 2. The model mainly comprises a reconstructed click image embedding module and a multi-scale self-adaptive fusion module, and is used for enhancing the learning capacity of the model to obtain a precise segmentation result.

The reconstructed click image embedding module performs feature separation and reconstruction according to the importance of different clicks to enhance the feature representation of important clicks, as shown in fig. 3. First, we normalize (Group NormalizTo) the extent to which the scaling factors in the layer evaluate the contribution of different clicks, for a given feature imageR is a four-dimensional tensor, wherein +.>For the number of batches>For the number of channels>Is the image size. We first normalize the input feature +.>As shown in equation 1-1.

Wherein the method comprises the steps ofMean and variance, respectively, ++>Is trainable parameter->The magnitude of the value reflects the extent of contribution of the spatial pixel information,/and>the rationality of division is ensured for constants with smaller values and larger than zero; />Representing +.>Group normalization operations are performed.

Normalizing the correlation weightsThe importance of the different feature maps is represented as shown in equations 1-2.

the normalized weights are then controlled to be in the range of 0-1 by the Sigmoid function, and then the above weights are distinguished by a gating mechanism (threshold is set to 0.5). Wherein the weight of 0.5 or more is considered as contributing greatlyAn artifact less than 0.5 is less contributing and is noted +.>As shown in equations 1-3.

Wherein the method comprises the steps ofFor the purpose of parameter->Wherein the default threshold size is 0.5. By comparison->And a threshold size to distinguish its extent of contribution;

finally, we multiply the input features X byAnd->Obtaining the characteristics with different contribution degrees>And->. And finally fusing the weighted two different information features, as shown in formulas 1-4.

The multi-scale adaptive fusion module learns how to filter the feature space at other levels in order to preserve useful spatial information adaptive combinations, as shown in fig. 4. First, we represent the different resolution features generated by the pyramid asWherein->Representing a hierarchy. Let->The representation is located at the +.>Feature vector at from level->Resizing to level->. At the level->The fusion pattern of (2) is shown in formulas 1-5.

Wherein the method comprises the steps ofRepresenting post-fusion spatial position->Feature vectors at the location. Wherein->、/>Four different levels to level for model adaptive learning>Is a spatial importance weight of (a). Let us letAnd->As shown in equations 1-6.

Wherein,is defined using a softmax function, wherein the control parameter +.>、/>、/>、/>The control parameters may be obtained by a 1x1 convolution.

And S5, selecting to add proper positive and negative clicks to mark the error region again according to the initial segmentation mask result.

The model returns the mask result of the first segmentation to the front end for display, and the positive and negative clicks are added to mark the error place of the segmentation mask according to the segmentation result for further modification. The method mainly comprises the steps of marking the vicinity of the edge of a segmented object, and repeating the step S3 to convert new positive and negative clicks to generate positive and negative click guidance after marking is completed.

And splicing the newly generated positive and negative click guide and the segmentation mask returned to the front end, adding the newly generated positive and negative click guide and the original image element by element, then sending the newly generated positive and negative click guide and the newly generated segmentation mask into the model for prediction again, and finally returning the corrected segmentation mask to the front end. The interactive correction is iterated as such until a satisfactory segmentation mask is obtained, as shown in fig. 5.

According to the interactive image segmentation method based on the Transformer, provided by the invention, aiming at the interactive segmentation model, the contribution degree of different clicks to the segmentation result is ignored, inconsistent conflict caused by multi-scale fusion is avoided, and a reconstructed click model (Interactive Segmentation with Reconstruct Click Vision Transformers) is designed, so that the model can fully utilize click interaction information, and the scale invariance of the features is improved. The method achieves the most advanced performance on the interactive image segmentation labeling task and also shows good generalization performance on medical images. Better segmentation results are obtained through fewer interaction times, and the method has commercial application value.

The above examples and drawings are not intended to limit the form or form of the present invention, and any suitable variations or modifications thereof by those skilled in the art should be construed as not departing from the scope of the present invention.

Claims

1. The interactive image segmentation method based on the Transformer is characterized by comprising the following steps of:

s1, selecting an image to be marked, and loading the image to be marked into interactive image segmentation marking software;

s2, selecting a segmentation target, starting with left click, generating a click record according to click behaviors after starting, and generating a click mark at a corresponding position;

s4, segmenting an appointed target in the image by utilizing a pre-training segmentation model, and returning to an initial segmentation mask; generating positive and negative click guidance by adopting a simulation sampling strategy when training a segmentation model, wherein the segmentation model is a reconstruction click model taking a transducer as a backbone, and the reconstruction click model mainly comprises a reconstruction click image embedding module and a multi-scale self-adaptive fusion module and is used for enhancing the learning capacity of the model to obtain a precise segmentation result;

the reconstructed click image embedding module performs feature separation and reconstruction according to the importance of different clicks so as to enhance the feature representation of the important clicks, and the method specifically comprises the following steps:

first, the contribution degree of different clicks is evaluated by scaling factors in the group normalization layer, for a given feature imageWherein R is a four-dimensional tensor, +.>For the number of batches>For the number of channels>Is the image size;

normalizing input features by simple normalization operationsThe specific formula is as follows:

normalizing the correlation weightsRepresenting the importance of different feature maps, the formula is:

wherein the method comprises the steps ofIs->Normalized weights of the individual channels, +.>Is->Weights of individual channels->Representing weighted summation of the weights of all channels;

then, the normalized weight is controlled in the range of 0-1 through a Sigmoid function, and the weights are distinguished by a gating mechanism; wherein the weight equal to or greater than the threshold is considered to contribute moreA smaller contribution than the threshold is noted as +.>The formula is:

Finally, fusing the weighted two different information features, wherein the formula is as follows:

the multi-scale self-adaptive fusion module learns how to filter the feature space on other levels so as to preserve useful spatial information self-adaptive combination, and the specific method is as follows:

first, features of different resolutions generated by pyramids are expressed asWherein->Representation hierarchy, orderThe representation is located at the +.>Feature vector at from level->Resizing to level->In the hierarchy->The formula of the fusion mode of (2) is as follows:

wherein the method comprises the steps ofRepresenting post-fusion spatial position->Feature vector at>、/>Four different levels to level for model adaptive learning>Spatial importance weight of>And is also provided withThe formula is:

wherein,defined using the softmax function, a control parameter +.>、/>、/>、/>The control parameters are obtained by 1x1 convolution;

2. The Transformer-based interactive image segmentation method of claim 1, wherein: in step S1, the size of the image to be marked is any size, and the image is loaded into the interactive image segmentation marking software, and after the interaction is completed, the size of the image is adaptively adjusted to the unified size 448 x 448, so that the size requirement of the segmentation model input is met.

3. The Transformer-based interactive image segmentation method of claim 1, wherein: in step S2, after clicking the start button of the menu bar, the user continuously monitors a mouse event, and when the user presses the left key at a certain position, the coordinate information of the position is recorded, and a green dot is generated at the corresponding position to indicate that the position is located inside the segmented object, namely the segmented target object; when released, the marking process is ended once; when the user presses the right key at a certain position, the coordinate information of the position is recorded, and a red dot is generated at the corresponding position to indicate that the position is located outside the segmented object, i.e. the non-segmented target object.

4. The Transformer-based interactive image segmentation method of claim 3, wherein: in step S2, the radius of the green dots and the red dots is 5 pixels, in step S3, the click record generated in step S2 is regarded as an inner area/outer area of the segmentation target, the circular disc graph with the radius of 5 pixels is generated according to click coordinate conversion and used as positive and negative click guidance, if the segmentation is the primary segmentation, an image with the size of 448 x 448 pixels is generated as an initial segmentation mask, the positive and negative click guidance and the initial segmentation mask are spliced to form a three-channel image with the size of 3 x 448 x 488, and the three-channel original RGB image with the size of 448 x 488 and the 3 x 448 x 448 image formed by the positive and negative click and the initial segmentation mask are added element by element, and then are input as a segmentation model.

5. The Transformer-based interactive image segmentation method of claim 1, wherein: in step S5, the segmentation model returns the mask result of the first segmentation to the front end for display, and can be further modified according to the segmentation result, and the positive and negative clicks are added to the wrong part of the segmentation mask for marking, wherein the mark is mainly performed near the edge of the segmented object, and after the marking is completed, the step S3 is repeated to convert the new positive and negative clicks again to generate a positive and negative click guide.

6. The method for interactive image segmentation based on Transformer of claim 5, wherein: in step S6, the newly generated positive and negative click guidance and the segmentation mask returned to the front end are spliced, then added element by element with the original image, and then sent to the model for re-prediction, and finally the corrected segmentation mask is returned to the front end, so that interactive correction is iterated until a satisfactory segmentation mask is obtained.