CN117372701A - Interactive image segmentation method based on Transformer - Google Patents

Interactive image segmentation method based on Transformer Download PDF

Info

Publication number
CN117372701A
CN117372701A CN202311667809.5A CN202311667809A CN117372701A CN 117372701 A CN117372701 A CN 117372701A CN 202311667809 A CN202311667809 A CN 202311667809A CN 117372701 A CN117372701 A CN 117372701A
Authority
CN
China
Prior art keywords
segmentation
click
image
model
mask
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311667809.5A
Other languages
Chinese (zh)
Other versions
CN117372701B (en
Inventor
何一凡
陈盼盼
王大寒
江楠峰
吴芸
王驰明
朱顺痣
于金喜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Ruiwei Information Technology Co ltd
Xiamen University of Technology
Original Assignee
Xiamen Ruiwei Information Technology Co ltd
Xiamen University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Ruiwei Information Technology Co ltd, Xiamen University of Technology filed Critical Xiamen Ruiwei Information Technology Co ltd
Priority to CN202311667809.5A priority Critical patent/CN117372701B/en
Publication of CN117372701A publication Critical patent/CN117372701A/en
Application granted granted Critical
Publication of CN117372701B publication Critical patent/CN117372701B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a method for segmenting an interactive image based on a transducer, which comprises the steps of selecting an image to be annotated, and loading the image to be annotated into interactive image segmentation annotation software; selecting a segmentation target, generating a click record according to the click action of a user, and generating a click mark at a corresponding position; after the interaction is confirmed, converting the interaction into a circular graph according to the click record, splicing the circular graph serving as a corresponding positive and negative click guide with an original mask, and adding the circular graph to the original graph to serve as a segmentation model to input; dividing a designated target in the image by using a pre-training dividing model, and returning to an initial dividing mask; selecting to add proper positive and negative clicks to mark the error region again according to the initial segmentation mask result; the new marker is fed into the segmentation model again and the corrected result is returned. And in this way, the segmentation result is refined to obtain a satisfactory result. The invention can improve the segmentation labeling performance of the interactive image and obtain better segmentation results with fewer interaction times.

Description

Interactive image segmentation method based on Transformer
Technical Field
The invention relates to the technical field of computer vision and interactive segmentation labeling, in particular to a interactive image segmentation method based on a Transformer.
Background
In the era of explosive development of large models, deep learning technology has undergone revolutionary development, mainly benefiting from large-scale and high-performance computer resources and huge data volumes. Large models exhibit excellent capabilities in terms of handling complex tasks and extracting high-level features, the performance of which is largely due to the successful application of deep learning methods for large amounts of training data. With the advent of large models, the need for high quality training data sets has also grown exponentially in areas such as semantic segmentation, instance segmentation, and salient object detection. However, the labeling of these datasets becomes increasingly cumbersome and expensive, particularly in the pixel-level labeling task. Therefore, the interactive image segmentation algorithm shows important research and application values, and more researchers are widely explored in the field. The interactive image segmentation algorithms become the optimal choice for reducing the manual labeling cost, and the introduction of the algorithms enables the labeling training data set to be more convenient and efficient, so that the rapid development of a large model can be promoted. While pursuing high precision and intelligent operation, these algorithms provide a practical solution to the challenge of scaling the labeling data. Through decades of development of interactive image segmentation technology, rich theories and methods are accumulated, and a relatively complete algorithm system and interaction framework are formed. The interaction mode mainly comprises the following steps: click, graffiti, bounding box, and outline. Interactive segmentation methods can be divided into two main categories, traditional and deep learning-based. Conventional approaches typically utilize underlying features such as pixel values and relationships between them to interact. The conventional method does not need to train a model, and thus cannot learn the existing experience. Therefore, this type of method requires a lot of labor cost to obtain a good segmentation effect. With the rapid development of deep learning, convolutional neural networks replace the traditional method, but interactive segmentation models based on the transformers exhibit more superior performance than convolutional neural networks. Although the above-described algorithm has achieved significant success in the task of interactive image segmentation, some limitations remain with existing segmentation methods. The method comprises the problems that the interaction mode is not flexible enough, the reflection of the user intention is not accurate enough, the segmentation precision is not high enough, more user interactions are needed, and the like. Therefore, these interactive image segmentation methods have shortcomings in facing natural images with complex backgrounds, silhouette-blurred medical images. Based on this, the present invention starts with improvements to the interactive segmentation model in these respects.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a Transformer-based interactive image segmentation method which can improve the segmentation labeling performance of an interactive image, can show good generalization on a hospital image and can obtain better segmentation results with fewer interaction times.
In order to achieve the above object, the solution of the present invention is:
the interactive image segmentation method based on the Transformer comprises the following steps of:
s1, selecting an image to be marked, and loading the image to the interactive image segmentation marking software;
s2, selecting a segmentation target by a user, starting with left click, generating a record according to click behaviors of the user after starting, and generating a click mark at a corresponding position;
s3, after the interaction is confirmed, converting the interaction into a circular graph according to the click record, splicing the circular graph with an original mask as a corresponding positive and negative click guide, and finally adding the circular graph with the original mask as a segmentation model to input;
s4, segmenting an appointed target in the image by utilizing a pre-training segmentation model, and returning to an initial segmentation mask;
s5, selecting to add proper positive and negative clicks to mark the error region again according to the initial segmentation mask result;
and S6, sending the new mark into the segmentation model again, returning a corrected result, and refining the segmentation result to obtain a satisfactory result in a reciprocating manner.
Further, in step S1, the image to be marked is of any size, and is loaded into the interactive image segmentation labeling software, and after the interaction is completed, the interactive image segmentation labeling software adaptively adjusts the size of the image to the uniform size 448 x 448, so as to ensure that the size requirement of the segmentation model input is met.
Further, in step S2, after clicking the menu bar start button, the user continuously monitors the mouse event, and when the user presses the left key at a certain position, the coordinate information of the position is recorded, and a green dot is generated at the corresponding position to indicate that the position is located inside the segmented object, that is, the segmented target object (foreground); when released, it indicates that the one-time marking process is finished, and similarly, when the user presses the right key at a certain position, the same operation is performed, except that a red dot is generated at this time to indicate that the position is outside the segmented object, i.e., the non-segmented target object (background).
Further, in step S2, the radius of the small dot is 5 pixels. When the radius is too large, positive/negative clicks at the foreground edge may be mistaken for negative/positive clicks, ambiguity is generated on the interactive information, so that model recognition errors generate poor segmentation results, and when the radius is too small, the range covered by the clicks is too small to provide rich information, so that a user may need to provide more interactive information, which violates the initial center of interactive segmentation.
In step S3, the click record generated in step S2 is regarded as an inner area/outer area of the segmentation target, a circular disc graph with a radius of 5 pixels is generated according to the click coordinate transformation and used as positive and negative click guidance, if the initial segmentation is performed, an image with a size of 448 x 448 pixels value of 0 is generated as an initial segmentation mask, and the positive and negative click guidance and the initial segmentation mask are spliced to form a three-channel image with a size of 3 x 448 x 488.
Further, the 3-channel original RGB image of 448 x 448 is added element by element with the 3 x 448 x 448 image formed by the positive and negative clicks and the initial segmentation mask, and is input as a segmentation model.
Further, in step S4, when the segmentation model is trained, since the deep learning model needs a large amount of labeling data, and millions of the deep learning model are easy, if the manual click interaction labeling is adopted, the cost is too high, so we use the analog sampling strategy to generate the positive and negative click guidance.
In step S4, the segmentation model is a reconstruction click model (Interactive Segmentation with Reconstruct Click Vision Transformers) using a transducer as a backbone, and the reconstruction click model mainly includes a reconstruction click image embedding module and a multi-scale adaptive fusion module, which are used for enhancing the learning ability of the model to obtain a precise segmentation result.
Further, the reconstructed click image embedding module performs feature separation and reconstruction according to importance of different clicks so as to enhance feature representation of the important clicks, and the specific method is as follows:
first, the contribution of different clicks is evaluated with scaling factors in the group normalization (Group Normalization) layer, for a given feature imageWherein R is a four-dimensional tensor, +.>For the number of batches>For the number of channels>Is the image size;
input features are first normalized by a simple normalization operationAs shown in formula 1-1:
wherein the method comprises the steps ofMean and variance, respectively, ++>Is trainable parameter->The magnitude of the value reflects the extent of contribution of the spatial pixel information,/and>the rationality of division is ensured for constants with smaller values and larger than zero; />Representing +.>Performing group normalization operation;
normalizing the correlation weightsRepresenting differentThe importance of the feature map is shown in formula 1-2:
wherein the method comprises the steps ofIs->Normalized weights of the individual channels, +.>Is->Weights of individual channels->Weighting and summing weights of all channels;
the normalized weights are then controlled to be in the range of 0-1 by the Sigmoid function, and then the above weights are distinguished by a gating mechanism (threshold is set to 0.5). Wherein the weight equal to or greater than the threshold value (0.5) is considered as contributing moreA less than threshold (0.5) is considered to contribute less, denoted +.>As shown in equations 1-3:
wherein the method comprises the steps ofFor the purpose of parameter->Threshold gating of (2), wherein the default thresholdThe value size was 0.5. By comparison->And a threshold size to distinguish its extent of contribution;
then, the input features X are multiplied byAnd->Obtaining the characteristics with different contribution degrees>And->
Finally, the weighted two different information features are fused, as shown in formulas 1-4:
further, the multi-scale adaptive fusion module learns how to filter the feature space at other levels so as to preserve useful spatial information adaptive combination, and the specific method is as follows:
first, features of different resolutions generated by pyramids are expressed asWherein->Express hierarchy, let->The representation is located at the +.>Feature vector at from level->Resizing to level->In the hierarchy->The fusion pattern of (2) is shown in formulas 1-5:
wherein the method comprises the steps ofRepresenting post-fusion spatial position->Feature vector at>、/>Four different levels to level for model adaptive learning>Spatial importance weight of>And is also provided withAs shown in equations 1-6:
further, in step S5, the segmentation model returns the mask result of the first segmentation to the front end for display, and may be further modified according to the segmentation result, and the positive and negative clicks are added to the wrong place of the segmentation mask for marking, where the vicinity of the edge of the segmented object is marked, and after the marking is completed, the step S3 is repeated to convert the new positive and negative clicks again to generate a positive and negative click guidance.
Further, in step S6, the newly generated positive and negative click guidance and the segmentation mask that was just returned to the front end are subjected to a stitching operation, then added element by element with the original image, and then sent to the model for re-prediction, and finally the corrected segmentation mask is returned to the front end, so that interactive correction is iterated until a satisfactory segmentation mask is obtained.
After the scheme is adopted, the interactive image segmentation method based on the Transformer provided by the invention aims at neglecting contribution degrees of different clicks to segmentation results of an interactive segmentation model, and designs a reconstructed click model (Interactive Segmentation with Reconstruct Click Vision Transformers) aiming at inconsistent conflicts caused by multi-scale fusion, so that the model can fully utilize click interaction information, and the scale invariance of the features is improved. The method achieves the most advanced performance on the interactive image segmentation labeling task and also shows good generalization performance on medical images. Better segmentation results are obtained through fewer interaction times, and the method has commercial application value.
Compared with the prior art, the invention has the following beneficial effects:
(1) The invention designs a brand-new interactive segmentation model based on a transducer.
(2) The invention effectively enables the segmentation model to fully utilize click interaction information, and strengthens the influence of interaction guidance on the segmentation result.
(3) The invention effectively solves the inconsistency caused by the fusion of the multi-scale features and improves the scale deformation of the features.
(4) The interactive image segmentation method based on the transducer provided by the invention shows good generalization performance on medical data sets.
Drawings
Fig. 1 is a flow chart of the interactive image segmentation method based on the Transformer.
Fig. 2 is a model diagram of the interactive image segmentation method based on the transducer of the present invention.
FIG. 3 is a schematic diagram of the present invention regarding the embedding of reconstructed click images.
FIG. 4 is a schematic diagram of the architecture of the present invention with respect to multi-scale adaptive fusion.
Fig. 5 is a schematic diagram of a segmentation result of the interactive image segmentation method based on the transducer of the present invention.
Detailed Description
In order to further explain the technical scheme of the invention, the invention is explained in detail by specific examples.
As shown in fig. 1 to 5, the present invention provides a transform-based interactive image segmentation method for interactively labeling a person with segmented images, comprising the following steps:
and S1, selecting an image to be marked, and loading the image to the interactive image segmentation marking software. The user can select any size of the picture, the picture is loaded into the interactive image segmentation software, and after the interaction is completed, the picture is adaptively adjusted to the uniform size 448 x 448, so that the size requirement of model input is met.
And S2, selecting a segmentation target by a user, starting with left click, generating a record according to the click action of the user after starting, and generating a click mark at a corresponding position.
After clicking the start button of the menu bar on the right side, the user can monitor the mouse event continuously, when the user presses the left key at a certain position, the coordinate information of the position can be recorded, and a green dot is generated at the corresponding position to indicate that the position is positioned in the segmented object, namely the segmented target object (foreground). When released, this indicates that the one-time marking process is over. Also, when the user presses the right key at a certain position, the same operation is performed, except that a red dot is generated at this time to indicate that the position is outside the segmented object, i.e., the non-segmented target object (background).
According to experiments and statistics, the radius of the small dots is 5 pixels. When the radius is too large, positive/negative clicks at the foreground edge can be mistaken for negative/positive clicks, and ambiguity is generated on interaction information, so that model identification errors are caused, and poor segmentation results are generated. When the radius is too small, the range covered by the click is too small to provide rich information, resulting in the user possibly needing to provide more interactive information, which violates the initial focus of the interactive segmentation.
And S3, after the interaction is confirmed, converting the interaction into a circular graph according to the click record, splicing the circular graph serving as a corresponding positive and negative click guide with an original mask, and finally adding the circular graph with the original mask to serve as a segmentation model to input.
The previous click record is regarded as an inner area/outer area of the segmentation target, and a circular disc graph with a radius of 5 pixels is generated according to click coordinate conversion and used as positive and negative click guidance respectively. In the case of the initial segmentation, an image with 448 x 448 pixels each having a value of 0 is generated as an initial segmentation mask. The positive and negative click guides and the initial segmentation mask are spliced to form a three-channel image of 3 x 448 x 488.
The 3-channel original RGB image of size 448 x 448 is added element-wise to the 3 x 448 x 448 image formed by the positive and negative clicks and the initial segmentation mask. And then the segmentation model is input.
And S4, segmenting the appointed target in the image by utilizing the pre-training segmentation model, and returning to the initial segmentation mask.
When the segmentation model is trained, a large amount of annotation data is needed by the deep learning model, millions of people are easy to use, and if manual click interaction annotation is adopted, the cost is too high. Therefore, we employ an analog sampling strategy to generate positive and negative click guides.
We designed a reconstructed click model (Interactive Segmentation with Reconstruct Click Vision Transformers) with a transducer backbone as shown in fig. 2. The model mainly comprises a reconstructed click image embedding module and a multi-scale self-adaptive fusion module, and is used for enhancing the learning capacity of the model to obtain a precise segmentation result.
The reconstructed click image embedding module performs feature separation and reconstruction according to the importance of different clicks to enhance the feature representation of important clicks, as shown in fig. 3. First, we normalize (Group NormalizTo) the extent to which the scaling factors in the layer evaluate the contribution of different clicks, for a given feature imageR is a four-dimensional tensor, wherein +.>For the number of batches>For the number of channels>Is the image size. We first normalize the input feature +.>As shown in equation 1-1.
Wherein the method comprises the steps ofMean and variance, respectively, ++>Is trainable parameter->The magnitude of the value reflects the extent of contribution of the spatial pixel information,/and>the rationality of division is ensured for constants with smaller values and larger than zero; />Representing +.>Group normalization operations are performed.
Normalizing the correlation weightsThe importance of the different feature maps is represented as shown in equations 1-2.
Wherein the method comprises the steps ofIs->Normalized weights of the individual channels, +.>Is->Weights of individual channels->Weighting and summing weights of all channels;
the normalized weights are then controlled to be in the range of 0-1 by the Sigmoid function, and then the above weights are distinguished by a gating mechanism (threshold is set to 0.5). Wherein the weight of 0.5 or more is considered as contributing greatlyAn artifact less than 0.5 is less contributing and is noted +.>As shown in equations 1-3.
Wherein the method comprises the steps ofFor the purpose of parameter->Wherein the default threshold size is 0.5. By comparison->And a threshold size to distinguish its extent of contribution;
finally, we multiply the input features X byAnd->Obtaining the characteristics with different contribution degrees>And->. And finally fusing the weighted two different information features, as shown in formulas 1-4.
The multi-scale adaptive fusion module learns how to filter the feature space at other levels in order to preserve useful spatial information adaptive combinations, as shown in fig. 4. First, we represent the different resolution features generated by the pyramid asWherein->Representing a hierarchy. Let->The representation is located at the +.>Feature vector at from level->Resizing to level->. At the level->The fusion pattern of (2) is shown in formulas 1-5.
Wherein the method comprises the steps ofRepresenting post-fusion spatial position->Feature vectors at the location. Wherein->、/>Four different levels to level for model adaptive learning>Is a spatial importance weight of (a). Let us letAnd->As shown in equations 1-6.
Wherein,is defined using a softmax function, wherein the control parameter +.>、/>、/>、/>The control parameters may be obtained by a 1x1 convolution.
And S5, selecting to add proper positive and negative clicks to mark the error region again according to the initial segmentation mask result.
The model returns the mask result of the first segmentation to the front end for display, and the positive and negative clicks are added to mark the error place of the segmentation mask according to the segmentation result for further modification. The method mainly comprises the steps of marking the vicinity of the edge of a segmented object, and repeating the step S3 to convert new positive and negative clicks to generate positive and negative click guidance after marking is completed.
And S6, sending the new mark into the segmentation model again, returning a corrected result, and refining the segmentation result to obtain a satisfactory result in a reciprocating manner.
And splicing the newly generated positive and negative click guide and the segmentation mask returned to the front end, adding the newly generated positive and negative click guide and the original image element by element, then sending the newly generated positive and negative click guide and the newly generated segmentation mask into the model for prediction again, and finally returning the corrected segmentation mask to the front end. The interactive correction is iterated as such until a satisfactory segmentation mask is obtained, as shown in fig. 5.
According to the interactive image segmentation method based on the Transformer, provided by the invention, aiming at the interactive segmentation model, the contribution degree of different clicks to the segmentation result is ignored, inconsistent conflict caused by multi-scale fusion is avoided, and a reconstructed click model (Interactive Segmentation with Reconstruct Click Vision Transformers) is designed, so that the model can fully utilize click interaction information, and the scale invariance of the features is improved. The method achieves the most advanced performance on the interactive image segmentation labeling task and also shows good generalization performance on medical images. Better segmentation results are obtained through fewer interaction times, and the method has commercial application value.
The above examples and drawings are not intended to limit the form or form of the present invention, and any suitable variations or modifications thereof by those skilled in the art should be construed as not departing from the scope of the present invention.

Claims (6)

1. The interactive image segmentation method based on the Transformer is characterized by comprising the following steps of:
s1, selecting an image to be marked, and loading the image to be marked into interactive image segmentation marking software;
s2, selecting a segmentation target, starting with left click, generating a click record according to click behaviors after starting, and generating a click mark at a corresponding position;
s3, after the interaction is confirmed, converting the interaction into a circular graph according to the click record, splicing the circular graph with an original mask as a corresponding positive and negative click guide, and finally adding the circular graph with the original mask as a segmentation model to input;
s4, segmenting an appointed target in the image by utilizing a pre-training segmentation model, and returning to an initial segmentation mask; generating positive and negative click guidance by adopting a simulation sampling strategy when training a segmentation model, wherein the segmentation model is a reconstruction click model taking a transducer as a backbone, and the reconstruction click model mainly comprises a reconstruction click image embedding module and a multi-scale self-adaptive fusion module and is used for enhancing the learning capacity of the model to obtain a precise segmentation result;
the reconstructed click image embedding module performs feature separation and reconstruction according to the importance of different clicks so as to enhance the feature representation of the important clicks, and the method specifically comprises the following steps:
first, the contribution degree of different clicks is evaluated by scaling factors in the group normalization layer, for a given feature imageWherein R is a four-dimensional tensor, +.>For the number of batches>For the number of channels>Is the image size;
normalizing input features by simple normalization operationsThe specific formula is as follows:
wherein the method comprises the steps ofMean and variance, respectively, ++>Is trainable parameter->The magnitude of the value reflects the extent of contribution of the spatial pixel information,/and>the rationality of division is ensured for constants with smaller values and larger than zero; />Representing +.>Performing group normalization operation;
normalizing the correlation weightsRepresenting the importance of different feature maps, the formula is:
wherein the method comprises the steps ofIs->Normalized weights of the individual channels, +.>Is->Weights of individual channels->Representing weighted summation of the weights of all channels;
then, the normalized weight is controlled in the range of 0-1 through a Sigmoid function, and the weights are distinguished by a gating mechanism; wherein the weight equal to or greater than the threshold is considered to contribute moreA smaller contribution than the threshold is noted as +.>The formula is:
wherein the method comprises the steps ofFor the purpose of parameter->Wherein the default threshold size is 0.5. By comparison->And a threshold size to distinguish its extent of contribution;
then, the input features X are multiplied byAnd->Obtaining the characteristics with different contribution degrees>And->
Finally, fusing the weighted two different information features, wherein the formula is as follows:
the multi-scale self-adaptive fusion module learns how to filter the feature space on other levels so as to preserve useful spatial information self-adaptive combination, and the specific method is as follows:
first, features of different resolutions generated by pyramids are expressed asWherein->Representation hierarchy, orderThe representation is located at the +.>Feature vector at from level->Resizing to level->In the hierarchy->The formula of the fusion mode of (2) is as follows:
wherein the method comprises the steps ofRepresenting post-fusion spatial position->Feature vector at>、/>Four different levels to level for model adaptive learning>Spatial importance weight of>And is also provided withThe formula is:
wherein,defined using the softmax function, a control parameter +.>、/>、/>、/>The control parameters are obtained by 1x1 convolution;
s5, selecting to add proper positive and negative clicks to mark the error region again according to the initial segmentation mask result;
and S6, sending the new mark into the segmentation model again, returning a corrected result, and refining the segmentation result to obtain a satisfactory result in a reciprocating manner.
2. The Transformer-based interactive image segmentation method of claim 1, wherein: in step S1, the size of the image to be marked is any size, and the image is loaded into the interactive image segmentation marking software, and after the interaction is completed, the size of the image is adaptively adjusted to the unified size 448 x 448, so that the size requirement of the segmentation model input is met.
3. The Transformer-based interactive image segmentation method of claim 1, wherein: in step S2, after clicking the start button of the menu bar, the user continuously monitors a mouse event, and when the user presses the left key at a certain position, the coordinate information of the position is recorded, and a green dot is generated at the corresponding position to indicate that the position is located inside the segmented object, namely the segmented target object; when released, the marking process is ended once; when the user presses the right key at a certain position, the coordinate information of the position is recorded, and a red dot is generated at the corresponding position to indicate that the position is located outside the segmented object, i.e. the non-segmented target object.
4. The Transformer-based interactive image segmentation method of claim 3, wherein: in step S2, the radius of the green dots and the red dots is 5 pixels, in step S3, the click record generated in step S2 is regarded as an inner area/outer area of the segmentation target, the circular disc graph with the radius of 5 pixels is generated according to click coordinate conversion and used as positive and negative click guidance, if the segmentation is the primary segmentation, an image with the size of 448 x 448 pixels is generated as an initial segmentation mask, the positive and negative click guidance and the initial segmentation mask are spliced to form a three-channel image with the size of 3 x 448 x 488, and the three-channel original RGB image with the size of 448 x 488 and the 3 x 448 x 448 image formed by the positive and negative click and the initial segmentation mask are added element by element, and then are input as a segmentation model.
5. The Transformer-based interactive image segmentation method of claim 1, wherein: in step S5, the segmentation model returns the mask result of the first segmentation to the front end for display, and can be further modified according to the segmentation result, and the positive and negative clicks are added to the wrong part of the segmentation mask for marking, wherein the mark is mainly performed near the edge of the segmented object, and after the marking is completed, the step S3 is repeated to convert the new positive and negative clicks again to generate a positive and negative click guide.
6. The method for interactive image segmentation based on Transformer of claim 5, wherein: in step S6, the newly generated positive and negative click guidance and the segmentation mask returned to the front end are spliced, then added element by element with the original image, and then sent to the model for re-prediction, and finally the corrected segmentation mask is returned to the front end, so that interactive correction is iterated until a satisfactory segmentation mask is obtained.
CN202311667809.5A 2023-12-07 2023-12-07 Interactive image segmentation method based on Transformer Active CN117372701B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311667809.5A CN117372701B (en) 2023-12-07 2023-12-07 Interactive image segmentation method based on Transformer

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311667809.5A CN117372701B (en) 2023-12-07 2023-12-07 Interactive image segmentation method based on Transformer

Publications (2)

Publication Number Publication Date
CN117372701A true CN117372701A (en) 2024-01-09
CN117372701B CN117372701B (en) 2024-03-12

Family

ID=89393288

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311667809.5A Active CN117372701B (en) 2023-12-07 2023-12-07 Interactive image segmentation method based on Transformer

Country Status (1)

Country Link
CN (1) CN117372701B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021150017A1 (en) * 2020-01-23 2021-07-29 Samsung Electronics Co., Ltd. Method for interactive segmenting an object on an image and electronic computing device implementing the same
CN115115830A (en) * 2022-05-17 2022-09-27 西北农林科技大学 Improved Transformer-based livestock image instance segmentation method
CN115359254A (en) * 2022-07-25 2022-11-18 华南理工大学 Vision transform network-based weak supervision instance segmentation method, system and medium
CN115482241A (en) * 2022-10-21 2022-12-16 上海师范大学 Cross-modal double-branch complementary fusion image segmentation method and device
CN116071553A (en) * 2023-02-16 2023-05-05 之江实验室 Weak supervision semantic segmentation method and device based on naive VisionTransformer
CN116258976A (en) * 2023-03-24 2023-06-13 长沙理工大学 Hierarchical transducer high-resolution remote sensing image semantic segmentation method and system
US20230368508A1 (en) * 2022-05-12 2023-11-16 Hitachi, Ltd. Area extraction method and area extraction system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021150017A1 (en) * 2020-01-23 2021-07-29 Samsung Electronics Co., Ltd. Method for interactive segmenting an object on an image and electronic computing device implementing the same
US20230368508A1 (en) * 2022-05-12 2023-11-16 Hitachi, Ltd. Area extraction method and area extraction system
CN115115830A (en) * 2022-05-17 2022-09-27 西北农林科技大学 Improved Transformer-based livestock image instance segmentation method
CN115359254A (en) * 2022-07-25 2022-11-18 华南理工大学 Vision transform network-based weak supervision instance segmentation method, system and medium
CN115482241A (en) * 2022-10-21 2022-12-16 上海师范大学 Cross-modal double-branch complementary fusion image segmentation method and device
CN116071553A (en) * 2023-02-16 2023-05-05 之江实验室 Weak supervision semantic segmentation method and device based on naive VisionTransformer
CN116258976A (en) * 2023-03-24 2023-06-13 长沙理工大学 Hierarchical transducer high-resolution remote sensing image semantic segmentation method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李国庆: "交互式图像分割方法研究", 中国博士学位论文全文数据库-信息科技辑, no. 2, pages 138 - 188 *

Also Published As

Publication number Publication date
CN117372701B (en) 2024-03-12

Similar Documents

Publication Publication Date Title
Gokaslan et al. Improving shape deformation in unsupervised image-to-image translation
CN109344701B (en) Kinect-based dynamic gesture recognition method
CN113158862B (en) Multitasking-based lightweight real-time face detection method
CN112784736B (en) Character interaction behavior recognition method based on multi-modal feature fusion
WO2023138062A1 (en) Image processing method and apparatus
CN111325750B (en) Medical image segmentation method based on multi-scale fusion U-shaped chain neural network
WO2022042348A1 (en) Medical image annotation method and apparatus, device, and storage medium
CN114724155A (en) Scene text detection method, system and equipment based on deep convolutional neural network
CN114782694A (en) Unsupervised anomaly detection method, system, device and storage medium
CN114511502A (en) Gastrointestinal endoscope image polyp detection system based on artificial intelligence, terminal and storage medium
CN113312973A (en) Method and system for extracting features of gesture recognition key points
CN110827304A (en) Traditional Chinese medicine tongue image positioning method and system based on deep convolutional network and level set method
CN111368733B (en) Three-dimensional hand posture estimation method based on label distribution learning, storage medium and terminal
CN107730568B (en) Coloring method and device based on weight learning
CN116434033A (en) Cross-modal contrast learning method and system for RGB-D image dense prediction task
Zhang et al. Hierarchical attention aggregation with multi-resolution feature learning for GAN-based underwater image enhancement
CN112070181B (en) Image stream-based cooperative detection method and device and storage medium
CN117372701B (en) Interactive image segmentation method based on Transformer
CN112561782A (en) Method for improving reality degree of simulation picture of offshore scene
CN112801238B (en) Image classification method and device, electronic equipment and storage medium
CN116129417A (en) Digital instrument reading detection method based on low-quality image
CN112862840B (en) Image segmentation method, device, equipment and medium
Chen et al. Application of generative adversarial network in image color correction
CN114387489A (en) Power equipment identification method and device and terminal equipment
CN114463346A (en) Complex environment rapid tongue segmentation device based on mobile terminal

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant