CN117911430A - Method and device for segmenting interactive microorganism image based on transformer - Google Patents

Method and device for segmenting interactive microorganism image based on transformer Download PDF

Info

Publication number
CN117911430A
CN117911430A CN202410308761.7A CN202410308761A CN117911430A CN 117911430 A CN117911430 A CN 117911430A CN 202410308761 A CN202410308761 A CN 202410308761A CN 117911430 A CN117911430 A CN 117911430A
Authority
CN
China
Prior art keywords
image
prompt
original image
mask
encoder
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410308761.7A
Other languages
Chinese (zh)
Inventor
奕巧莲
徐英春
谢秀丽
李柏蕤
连荷清
方喆君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Xiaofei Technology Co ltd
Peking Union Medical College Hospital Chinese Academy of Medical Sciences
Original Assignee
Beijing Xiaofei Technology Co ltd
Peking Union Medical College Hospital Chinese Academy of Medical Sciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Xiaofei Technology Co ltd, Peking Union Medical College Hospital Chinese Academy of Medical Sciences filed Critical Beijing Xiaofei Technology Co ltd
Priority to CN202410308761.7A priority Critical patent/CN117911430A/en
Publication of CN117911430A publication Critical patent/CN117911430A/en
Pending legal-status Critical Current

Links

Landscapes

  • Image Analysis (AREA)

Abstract

The invention discloses a method and a device for segmenting an interactive microorganism image based on a transformer, wherein the method comprises the following steps: responding to a prompt instruction, acquiring an original image to be segmented, wherein the original image is provided with a prompt mark corresponding to the prompt instruction; inputting the original image and the prompt mark into a pre-trained image segmentation model to obtain a segmentation result output by the image segmentation model; wherein the network structure of the image segmentation model comprises: the encoder is used for extracting the characteristics of the input original image in multiple dimensions, combining the types of the prompt marks and fusing the image characteristics and the prompt mark characteristics to obtain fusion characteristics; the types of the prompt marks comprise sparse type and dense type; and the decoder is used for decoding the fusion characteristic to obtain a segmentation mask of the target region. The accuracy and the operation flexibility of the segmentation result of the microbial image are improved.

Description

Method and device for segmenting interactive microorganism image based on transformer
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a method and a device for segmenting an interactive microorganism image based on a transformer.
Background
In the practical medical application scene, the segmentation of images such as CT images and images under a microscope is involved, and in the prior art, a binarization method is mostly adopted to obtain segmentation results when image segmentation is performed by taking a microbial image under a microscope as an example.
Image segmentation is one of important research hotspots in the field of image processing, and refers to a process of dividing an image into a plurality of areas with consistent inside and large outside phase difference according to the characteristics of gray scale, color, texture or shape of the image. Image binarization is a common basic method in image segmentation, such as target recognition, character recognition, edge extraction and the like, and image binarization is required. At present, a binarization method based on a threshold value selects one or more threshold values in the whole segmentation process, and a binarization result can be obtained after comparing the gray value of each pixel in an image with the size of the threshold value. However, the binarization image segmentation method often cannot obtain a satisfactory segmentation effect for images with complex background or uneven illumination, and in the specific technical field of microbial images, the accuracy and the operation flexibility of segmentation results are poor.
In view of this, providing a method and a device for segmenting an interactive microorganism image based on a transducer to improve accuracy and operation flexibility of a segmentation result of the microorganism image is a problem to be solved by those skilled in the art.
Disclosure of Invention
Therefore, the embodiment of the invention provides a method and a device for segmenting an interactive microorganism image based on a transducer, so as to improve the accuracy and the operation flexibility of a segmentation result of the microorganism image.
In order to achieve the above object, the embodiment of the present invention provides the following technical solutions:
The invention provides a transformation-based interactive microorganism image segmentation method, which comprises the following steps:
responding to a prompt instruction, acquiring an original image to be segmented, wherein the original image is provided with a prompt mark corresponding to the prompt instruction;
inputting the original image and the prompt mark into a pre-trained image segmentation model to obtain a segmentation result output by the image segmentation model;
wherein the network structure of the image segmentation model comprises:
The encoder is used for extracting the characteristics of the input original image in multiple dimensions, combining the types of the prompt marks and fusing the image characteristics and the prompt mark characteristics to obtain fusion characteristics; the types of the prompt marks comprise sparse type and dense type;
And the decoder is used for decoding the fusion characteristic to obtain a segmentation mask of the target region.
In some embodiments, the cue markers include a target point, a target box, cue text, or a target region mask on the original image.
In some embodiments, in the case that the cue marker is a target point, a target frame, or a cue text on the original image, the type of the cue marker is sparse;
The encoder comprises an image encoder and a mark encoder, combines the type of the prompt mark, fuses the image characteristic and the prompt mark characteristic, and specifically comprises the following steps:
Inputting the original image into the image encoder to obtain image characteristics; inputting the prompt marks into the mark encoder to obtain mark characteristics;
And fusing the image features with the marking features to obtain the fused features.
In some embodiments, where the hint-tag is a target area mask, the type of hint-tag is dense;
the encoder comprises an image encoder and a convolution network, combines the type of the prompt mark, fuses the image characteristic and the prompt mark characteristic, and specifically comprises the following steps:
inputting the original image into the image encoder to obtain image characteristics; inputting the target area mask into the convolutional network for downsampling to obtain mask characteristics;
and fusing the image features and the mask features to obtain the fused features.
In some embodiments, the image encoder comprises:
the convolution layer is used for downsampling an input original image to obtain characteristics of multiple dimensions;
And the global attention modules are used for mapping and fusing the characteristics of each dimension to obtain image characteristics.
In some embodiments, the decoder comprises:
A cross-attention module for decoding the input image features and the marker features to obtain updated image features and updated marker features;
And the prediction module is used for receiving the updated image characteristics and the updated marking characteristics so as to obtain a predicted segmentation result.
In some embodiments, the training process of the decoder comprises:
inputting a prompt mark on an original image, and labeling a prompt on the original image according to the type of the prompt mark;
Randomly selecting prompting points in an error area by taking a prediction mask and a labeling result obtained in the previous iteration as initial data, taking the prediction mask of the previous iteration as a prompting of a new iteration, taking a mask with the highest cross-over ratio as a prompting of the next iteration after a plurality of masks are obtained, and sampling the prompting points;
stopping iteration when the iteration times reach a preset value or the prediction mask reaches preset accuracy.
The invention also provides a transformation-based interactive microbial image segmentation device for implementing the method, which comprises the following steps:
The image acquisition unit is used for responding to the prompt instruction and acquiring an original image to be segmented, wherein the original image is provided with a prompt mark corresponding to the prompt instruction;
the result generation unit is used for inputting the original image and the prompt mark into a pre-trained image segmentation model so as to obtain a segmentation result output by the image segmentation model;
wherein the network structure of the image segmentation model comprises:
The encoder is used for extracting the characteristics of the input original image in multiple dimensions, combining the types of the prompt marks and fusing the image characteristics and the prompt mark characteristics to obtain fusion characteristics; the types of the prompt marks comprise sparse type and dense type;
And the decoder is used for decoding the fusion characteristic to obtain a segmentation mask of the target region.
The invention also provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method as described above when executing the program.
The invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method as described above.
According to the interactive microorganism image segmentation method and device based on the transformer, provided by the invention, an original image to be segmented is obtained by responding to a prompt instruction, and the original image is provided with a prompt mark corresponding to the prompt instruction; and inputting the original image and the prompt mark into a pre-trained image segmentation model to obtain a segmentation result output by the image segmentation model. In an actual use scene, the prompting instruction can be a dot, a frame of a picture and a provided mask, or a region to be segmented is described by a section of text, after the corresponding prompting instruction is obtained, the image segmentation model can predict an image segmentation result meeting prompting requirements according to the original image and the prompting instruction, compared with a traditional medical image segmentation algorithm, the method is more flexible, and the problem that the accuracy of a binarization algorithm is poor when background noise is large is solved, so that the accuracy and the operation flexibility of the segmentation result of a microbial image are improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It will be apparent to those of ordinary skill in the art that the drawings in the following description are exemplary only and that other implementations can be obtained from the extensions of the drawings provided without inventive effort.
The structures, proportions, sizes, etc. shown in the present specification are shown only for the purposes of illustration and description, and are not intended to limit the scope of the invention, which is defined by the claims, so that any structural modifications, changes in proportions, or adjustments of sizes, which do not affect the efficacy or the achievement of the present invention, should fall within the ambit of the technical disclosure.
FIG. 1 is a flowchart of a method for segmenting an interactive microorganism image based on a transducer according to the present invention;
FIG. 2 is a second flowchart of a method for partitioning an interactive microorganism image based on a transducer according to the present invention;
FIG. 3 is a diagram of a partial encoder network in an image segmentation model according to the present invention;
FIG. 4 is a diagram illustrating a network configuration of a decoder in an image segmentation model according to the present invention;
FIG. 5 is a third flowchart of a method for partitioning an interactive microorganism image based on a transducer according to the present invention;
Fig. 6 is a block diagram of an interactive microorganism image segmentation apparatus based on a transducer according to the present invention;
fig. 7 is a block diagram of a computer device according to the present invention.
Detailed Description
Other advantages and advantages of the present invention will become apparent to those skilled in the art from the following detailed description, which, by way of illustration, is to be read in connection with certain specific embodiments, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The method and apparatus for interactive microorganism image segmentation based on a transducer provided by the present invention are described below with reference to fig. 1 to 6.
Referring to fig. 1, fig. 1 is a flowchart of a method for segmenting an interactive microorganism image based on a transducer according to the present invention.
In one embodiment, the method for segmenting an interactive microorganism image based on a transducer provided by the invention comprises the following steps:
S110: responding to a prompt instruction, acquiring an original image to be segmented, wherein the original image is provided with a prompt mark corresponding to the prompt instruction; wherein, the prompt mark can be a target point, a target frame, prompt text or a target area mask on the original image. In a specific use scene, various prompt marks such as points and frames drawn by a user on an original image, written prompt words or set masks can be used, the types of the prompt marks can be flexibly selected according to habits of the user, and the output result of a model is not influenced;
S120: inputting the original image and the prompt mark into a pre-trained image segmentation model to obtain a segmentation result output by the image segmentation model; it should be understood that the segmentation result is embodied on the segmented image in the form of a mask, in the training process of the image segmentation model, an image sample and a corresponding prompt marker are input, feature fusion is performed according to the type of the prompt marker to obtain fusion features of the sample, the fusion features are utilized to train the deep learning network, loss is calculated according to the result of model prediction and the real result (such as a prediction mask and a real mask), and the model is subjected to iterative optimization to obtain a final image segmentation model.
Wherein the network structure of the image segmentation model comprises:
The encoder is used for extracting the characteristics of the input original image in multiple dimensions, combining the types of the prompt marks and fusing the image characteristics and the prompt mark characteristics to obtain fusion characteristics; the types of the prompt marks comprise sparse type and dense type;
And the decoder is used for decoding the fusion characteristic to obtain a segmentation mask of the target region.
In a specific use scene, viT algorithm parameters obtained based on MAE algorithm pre-training can be used as initialization parameters, wherein the MAE algorithm is a self-supervision method widely applied to the field of computer vision, and the initialization method can enable the training of the algorithm to be more efficient, accelerate convergence speed and improve algorithm performance.
The encoder may employ a sparse-type template encoding, which may be a point of interest, a frame, or a written text drawn in a microorganism image, in which case the sparse-type template encoding includes a mask of interest, the sparse-type template encoding is mapped to 256 dimensions, and the dense-type template encoding is mapped to the same dimensions as the image encoding.
In some embodiments, in the case that the cue marker is a target point, a target frame, or a cue text on the original image, the type of the cue marker is sparse; the encoder includes an image encoder and a tag encoder, and in this case, as shown in fig. 2, the merging the extracted dimensional features by combining the type of the hint tag specifically includes:
Inputting the original image into the image encoder to obtain image characteristics; inputting the prompt marks into the mark encoder to obtain mark characteristics;
And fusing the image features with the marking features to obtain the fused features.
Specifically, sparse-type prompt encoding is, for example, encoding with dots, boxes, or text as hint marks. When the hint is marked as a point or box, a point can be represented by coordinates, i.e. (x, y), and the point is marked as belonging to the foreground or background, and the box is represented by coordinates of the upper left corner and the lower right corner, i.e. (x 1, y 1), (x 2, y 2), and these coordinates are encoded, mainly including coordinate scaling, normalization, mapping, sin and cos operations. When the prompt marks are texts, the text sample coding is based on a CLIP algorithm, the CLIP algorithm is a multi-mode algorithm, a plurality of modes refer to images and texts, the images and the texts are respectively provided with respective encoders, corresponding characteristic representations are obtained after the images and the texts pass through the encoders, then the images and the texts belonging to the same pair are made to be as close as possible and the images and the texts not belonging to the same pair are made to be as far as possible based on a contrast learning method, and in the embodiment, the text encoders in the CLIP are used when the text sample is coded.
In some embodiments, where the hint-tag is a target area mask, the type of hint-tag is dense; the encoder comprises an image encoder and a convolution network, and at the moment, the extracted dimension features are fused by combining the type of the prompt mark, and the method specifically comprises the following steps:
inputting the original image into the image encoder to obtain image characteristics; inputting the target area mask into the convolutional network for downsampling to obtain mask characteristics;
and fusing the image features and the mask features to obtain the fused features.
In particular, the compact prompt code is a code with a mask (mask) as a hint mark, for example. For example, dense template coding refers to a mask, down-sampling the mask using a convolution of step size 2, then adjusting the number of channels using a convolution of 1 x 1, adding the resulting mask to the image features pixel by pixel, and, in particular, considering the case where there is no dense template, its coding is represented by a learner-able embedding.
In general, as shown in fig. 2, an original image is subjected to an image encoder to obtain image features, a dense prompt is subjected to convolution downsampling to obtain dense prompt features, the dense prompt features are used for being added with the image features, a sparse prompt is subjected to the sparse encoder to obtain sparse prompt features, and the features are input to a decoder to obtain a final predicted segmentation result.
In some embodiments, the image encoder comprises:
the convolution layer is used for downsampling an input original image to obtain characteristics of multiple dimensions;
And the global attention modules are used for mapping and fusing the characteristics of each dimension to obtain image characteristics.
Specifically, the image encoder may be based on any neural network, which is used to extract the microbial image features and output a c×h×w image code, and this embodiment selects ViT as the encoder, uses a convolution with a size of 16×16 and a step size of 16 to downsample, and then connects 4 global attention modules with unchanged spatial size, the attention modules consist of multi-headed self-attention, see fig. 3, for fusing global features from different angles, whose core is a self-attention mechanism, which first converts the features into query features, key features, value features, then multiplies the query with key, divides the query by one half of the feature dimension, obtains a score matrix after softmax, finally multiplies the score matrix with value link to obtain features fusing global information, each head fuses features from different dimensions, splices the outputs of all heads together, passes through the earr layer, reduces the number of channels to 256 using a convolution with 1×1, then connects a convolution with a size of 3×3, and connects each layer with unchanged number of channels after layer normalization.
In some embodiments, the decoder comprises:
A cross-attention module for decoding the input image features and the marker features to obtain updated image features and updated marker features;
And the prediction module is used for receiving the updated image characteristics and the updated marking characteristics so as to obtain a predicted segmentation result.
Specifically, as shown in FIG. 4, the input to the decoder is the image embedding, i.e., the features of the microorganism image obtained by passing through the image encoder, and some tokens, these token include prompt tokens, output tokens, the iou token, the decoder output being mask predictions and their iou scores. Wherein prompt tokens includes the sparse prompt tokens, dense sample token, output tokens is a set of learnable embedding for predicting a mask, the iou token is used for evaluating the mask, and the left part of the decoder is composed of 4 parts, see fig. 4, respectively:
1) Self-attention between token, the calculation is the same as the attention mechanism in the image encoder, see FIG. 3;
2) Cross-attention between token and image embedding, where query is token, and key and value are images embedding, differs from the self-attention mechanism mentioned in the image encoder in that query, key, value is chosen differently, image encoder query, key, value is all from the same vector, and query, key, value is from a different vector, where query is from token, key and value are from image embedding;
3) The MLP of the point-wise updates the token;
4) Cross-attention between image embedding and token, similar to step 2), where query is image embedding, key and value are token, i.e., image embedding is updated with the information of the template token.
The residual connection is applied to self-attention, cross-attention, MLP, the decoder contains two layers of the structure described above, the next decoder layer uses the updated tokens and image embedding of the previous layer;
The right part of the decoder, predicted based on updated image embedding and updated tokens, includes:
1) For image embedding, up-sampling 4 times based on two transposed convolutions of step size 2, the channel dimensions of each transposed convolution output are 64 and 32, using GELU activation function, and layer normalization;
2) For update tokens, update tokens again based on cross-attention, where query is token, key and value are image embedding, followed by two sets of MLP structures, one set of MLPs to output a new output token vector, so that the vector can be matched to the channel dimension of image embedding after upsampling, and based on image embedding after upsampling and the updated output token vector, apply a dot product over space to get the predicted mask. Another group of MLPs are used to predict the iou of the masks, so that each mask can be evaluated when a plurality of masks are output;
The embedding dimension used in the above-mentioned transformer structure is 256, in the middle process, MLP can increase the dimension of token to 2048, in order to increase computational efficiency, in the cross-attention stage, the number of image token is 64x64, the dimension of query, key, value is reduced to half of the original one, 128 is changed, and 8 heads are used in the above-mentioned attention layers;
To better distinguish between geometric positions in an image, image position coding needs to be added when image embedding is involved in attention calculations, and original tokens (including position coding) needs to be added when token is updated.
In some embodiments, the training process of the decoder comprises:
inputting a prompt mark on an original image, and labeling a prompt on the original image according to the type of the prompt mark;
Randomly selecting prompting points in an error area by taking a prediction mask and a labeling result obtained in the previous iteration as initial data, taking the prediction mask of the previous iteration as a prompting of a new iteration, taking a mask with the highest cross-over ratio as a prompting of the next iteration after a plurality of masks are obtained, and sampling the prompting points;
stopping iteration when the iteration times reach a preset value or the prediction mask reaches preset accuracy.
Specifically, this embodiment uses AdamW optimizers, adamW optimizers with beta1 set to 0.9 and beta2 set to 0.999, learning first being linear warmup until 8e-4, then step-wise decrementing. To avoid the algorithm from over fitting the data, the algorithm is regularized using WEIGHT DECAY and dropout strategies, where the parameters of WEIGHT DECAY are set to 0.1 and the ratio of dropout is set to 0.4. In the algorithm optimization process, in general, when there is only one prompt, the prompt is relatively fuzzy, and a plurality of masks meeting the condition may occur, in order to solve the problem, when there is only one prompt, a plurality of masks are predicted by using a plurality output tokens, typically 3 masks are predicted, and the 3 masks comprise a nested structure, which covers whole, local and sub-local, i.e. a whole microorganism, a partial microorganism and a smaller microorganism structure, in order to evaluate each mask in the application, an additional token and a lightweight head are added, i.e. the iou token mentioned in the decoder, for predicting the confidence score of each mask, and the confidence score is calculated by predicting the iou between the mask and the GT.
When there are multiple prompts, the fuzzy situation usually occurs less, resulting in the masks of the three prediction outputs being similar, in order to reduce the amount of calculation in training, it is ensured that a single mask that is not fuzzy can be effectively updated, so when there are more than 1 prompt, only one mask is predicted, and therefore the 4 th output token is added, and when there are multiple prompt, the masks are predicted.
In the model training process, considering the conditions of unbalanced foreground and background and unbalanced difficult and easy samples during the segmentation of the microbial images, focal loss and Dice loss are selected as the loss functions of the prediction mask, MSE loss functions are used for prediction iou, and 3 loss functions are according to 20:1:1, only the mask with the smallest loss is considered when predicting multiple masks.
For high efficiency of calculation, each microorganism picture passes through the encoder once, and the decoder is iterative training, and the iterative training process is as follows:
First, the dots are randomly selected on the labeled mask, the frame is the frame labeled mask, for robustness in testing, a bit of noise is randomly added, the degree of noise addition is controlled to be 10%, and the maximum cannot be shifted by 20 pixels.
After the first prediction step, in combination with the predicted mask and the labeling result, points are randomly selected in the wrong areas, and the points can be foreground or background, i.e. false positive or false negative points, the mask prediction of the last iteration is used as a new round of prompt, in order to give the next round of iteration as much information as possible, we use a non-thresholded mask logic instead of a binary mask, and after obtaining a plurality of masks, the mask with the highest iou value is used for sampling points in the next round of iteration.
The above procedure is iterated 8 times during training and 16 times during testing, and in order to benefit the model from the mask of the previous iteration prediction, two additional iterations are set, in which no additional points are sampled, one of the iterations is randomly inserted into the 8 iterations with sampling points, and the other iteration is always put at the end.
In summary, 11 iterations are taken together, wherein 1 iteration uses the initialized prompt embedding,8 iterations use the point based on the previous round of prediction sampling as a sampling token,2 iterations learn to correct the mask of the previous round of prediction, and since the decoder is lightweight, multiple iterations can be performed without affecting the efficiency.
In order to facilitate understanding, a specific usage scenario is taken as an example, and the implementation procedure of the method provided by the present invention is briefly described below.
As shown in fig. 5, in an actual medical application scenario, the above medical segmentation algorithm may be applied in various medical fields, such as CT images, images under a microscope, and the like, and in this patent, a microbial image under a microscope is taken as an example, a template is taken as an example, a doctor draws a frame of interest on the image, inputs the frame and the image into the algorithm, and the algorithm returns a mask image of microbial segmentation within the frame. In addition to this example, the doctor can draw a point, or provide a mask as the region of interest, or describe the region of interest in text, and input these cues and artwork into the algorithm, and can obtain a mask that segments the region of interest.
In the above specific embodiment, according to the method for segmenting an interactive microorganism image based on a transducer provided by the invention, an original image to be segmented is obtained by responding to a prompt instruction, and the original image is provided with a prompt mark corresponding to the prompt instruction; and inputting the original image and the prompt mark into a pre-trained image segmentation model to obtain a segmentation result output by the image segmentation model. In an actual use scene, the prompting instruction can be a dot, a frame of a picture and a provided mask, or a region to be segmented is described by a section of text, after the corresponding prompting instruction is obtained, the image segmentation model can predict an image segmentation result meeting prompting requirements according to the original image and the prompting instruction, compared with a traditional medical image segmentation algorithm, the method is more flexible, and the problem that the accuracy of a binarization algorithm is poor when background noise is large is solved, so that the accuracy and the operation flexibility of the segmentation result of a microbial image are improved.
In addition to the above method, the present invention also provides a transformation-based interactive microbial image segmentation apparatus for implementing the method as described above, as shown in fig. 6, the apparatus comprising:
The image acquisition unit 610 is configured to respond to a prompt instruction, and acquire an original image to be segmented, where the original image has a prompt mark corresponding to the prompt instruction;
a result generating unit 620, configured to input the original image and the prompt tag into a pre-trained image segmentation model, so as to obtain a segmentation result output by the image segmentation model;
wherein the network structure of the image segmentation model comprises:
The encoder is used for extracting the characteristics of the input original image in multiple dimensions, combining the types of the prompt marks and fusing the image characteristics and the prompt mark characteristics to obtain fusion characteristics; the types of the prompt marks comprise sparse type and dense type;
And the decoder is used for decoding the fusion characteristic to obtain a segmentation mask of the target region.
In some embodiments, the cue markers include a target point, a target box, cue text, or a target region mask on the original image.
In some embodiments, in the case that the cue marker is a target point, a target frame, or a cue text on the original image, the type of the cue marker is sparse;
The encoder comprises an image encoder and a mark encoder, and the extracted dimension features are fused by combining the type of the prompt mark, and the method specifically comprises the following steps:
Inputting the original image into the image encoder to obtain image characteristics; inputting the prompt marks into the mark encoder to obtain mark characteristics;
And fusing the image features with the marking features to obtain the fused features.
In some embodiments, where the hint-tag is a target area mask, the type of hint-tag is dense;
The encoder comprises an image encoder and a convolution network, and fuses the extracted dimensional characteristics by combining the type of the prompt mark, and specifically comprises the following steps:
inputting the original image into the image encoder to obtain image characteristics; inputting the target area mask into the convolutional network for downsampling to obtain mask characteristics;
and fusing the image features and the mask features to obtain the fused features.
In some embodiments, the image encoder comprises:
the convolution layer is used for downsampling an input original image to obtain characteristics of multiple dimensions;
And the global attention modules are used for mapping and fusing the characteristics of each dimension to obtain image characteristics.
In some embodiments, the decoder comprises:
A cross-attention module for decoding the input image features and the marker features to obtain updated image features and updated marker features;
And the prediction module is used for receiving the updated image characteristics and the updated marking characteristics so as to obtain a predicted segmentation result.
In some embodiments, the training process of the decoder comprises:
inputting a prompt mark on an original image, and labeling a prompt on the original image according to the type of the prompt mark;
Randomly selecting prompting points in an error area by taking a prediction mask and a labeling result obtained in the previous iteration as initial data, taking the prediction mask of the previous iteration as a prompting of a new iteration, taking a mask with the highest cross-over ratio as a prompting of the next iteration after a plurality of masks are obtained, and sampling the prompting points;
stopping iteration when the iteration times reach a preset value or the prediction mask reaches preset accuracy.
In the above specific embodiment, the method and the device for segmenting an interactive microorganism image based on a transducer provided by the invention acquire an original image to be segmented by responding to a prompt instruction, wherein the original image is provided with a prompt mark corresponding to the prompt instruction; and inputting the original image and the prompt mark into a pre-trained image segmentation model to obtain a segmentation result output by the image segmentation model. In an actual use scene, the prompting instruction can be a dot, a frame of a picture and a provided mask, or a region to be segmented is described by a section of text, after the corresponding prompting instruction is obtained, the image segmentation model can predict an image segmentation result meeting prompting requirements according to the original image and the prompting instruction, compared with a traditional medical image segmentation algorithm, the method is more flexible, and the problem that the accuracy of a binarization algorithm is poor when background noise is large is solved, so that the accuracy and the operation flexibility of the segmentation result of a microbial image are improved.
In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 7. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used to store static information and dynamic information data. The network interface of the computer device is used for communicating with an external terminal through a network connection. Which computer program, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.
It will be appreciated by those skilled in the art that the structure shown in FIG. 7 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
Corresponding to the above embodiments, the present invention further provides a computer storage medium, which contains one or more program instructions. Wherein the one or more program instructions are for being executed with the method as described above.
The present invention also provides a computer program product comprising a computer program storable on a non-transitory computer readable storage medium, the computer program being capable of performing the above method when being executed by a processor.
In the embodiment of the invention, the processor may be an integrated circuit chip with signal processing capability. The processor may be a general purpose processor, a digital signal processor (DIGITAL SIGNAL processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), field programmable gate array (Field Programmable GATE ARRAY, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components.
The disclosed methods, steps, and logic blocks in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The processor reads the information in the storage medium and, in combination with its hardware, performs the steps of the above method.
The storage medium may be memory, for example, may be volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory.
The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable ROM (ELECTRICALLY EPROM, EEPROM), or a flash Memory.
The volatile memory may be a random access memory (Random Access Memory, RAM for short) which acts as an external cache. By way of example, and not limitation, many forms of RAM are available, such as static random access memory (STATIC RAM, SRAM), dynamic random access memory (DYNAMIC RAM, DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (double DATA RATE SDRAM, ddr SDRAM), enhanced synchronous dynamic random access memory (ENHANCED SDRAM, ESDRAM), synchronous link dynamic random access memory (SYNCHLINK DRAM, SLDRAM), and direct memory bus random access memory (Direct Rambus RAM, DRRAM).
The storage media described in embodiments of the present invention are intended to comprise, without being limited to, these and any other suitable types of memory.
Those skilled in the art will appreciate that in one or more of the examples described above, the functions described in the present invention may be implemented in a combination of hardware and software. When the software is applied, the corresponding functions may be stored in a computer-readable medium or transmitted as one or more instructions or code on the computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.
The foregoing detailed description of the invention has been presented for purposes of illustration and description, and it should be understood that the foregoing is by way of illustration and description only, and is not intended to limit the scope of the invention.

Claims (10)

1. A method for interactive microbial image segmentation based on a transducer, the method comprising:
responding to a prompt instruction, acquiring an original image to be segmented, wherein the original image is provided with a prompt mark corresponding to the prompt instruction;
inputting the original image and the prompt mark into a pre-trained image segmentation model to obtain a segmentation result output by the image segmentation model;
wherein the network structure of the image segmentation model comprises:
The encoder is used for extracting the characteristics of the input original image in multiple dimensions, combining the types of the prompt marks and fusing the image characteristics and the prompt mark characteristics to obtain fusion characteristics; the types of the prompt marks comprise sparse type and dense type;
And the decoder is used for decoding the fusion characteristic to obtain a segmentation mask of the target region.
2. The method of claim 1, wherein the cue markers comprise target points, target boxes, cue text, or target area masks on the original image.
3. The method for interactive microbial image segmentation based on transfomer according to claim 2, wherein in case the prompt tag is a target point, a target box or a prompt text on the original image, the type of the prompt tag is sparse;
The encoder comprises an image encoder and a mark encoder, combines the type of the prompt mark, fuses the image characteristic and the prompt mark characteristic, and specifically comprises the following steps:
Inputting the original image into the image encoder to obtain image characteristics; inputting the prompt marks into the mark encoder to obtain mark characteristics;
And fusing the image features with the marking features to obtain the fused features.
4. The method for interactive microbial image segmentation based on transfomer according to claim 2, wherein the type of the hint marks is dense in case the hint marks are target area masks;
the encoder comprises an image encoder and a convolution network, combines the type of the prompt mark, fuses the image characteristic and the prompt mark characteristic, and specifically comprises the following steps:
inputting the original image into the image encoder to obtain image characteristics; inputting the target area mask into the convolutional network for downsampling to obtain mask characteristics;
and fusing the image features and the mask features to obtain the fused features.
5. The method of claim 3 or 4, wherein the image encoder comprises:
the convolution layer is used for downsampling an input original image to obtain characteristics of multiple dimensions;
And the global attention modules are used for mapping and fusing the characteristics of each dimension to obtain image characteristics.
6. The method of claim 1, wherein the decoder comprises:
A cross-attention module for decoding the input image features and the marker features to obtain updated image features and updated marker features;
And the prediction module is used for receiving the updated image characteristics and the updated marking characteristics so as to obtain a predicted segmentation result.
7. The method of claim 6, wherein the training process of the decoder comprises:
inputting a prompt mark on an original image, and labeling a prompt on the original image according to the type of the prompt mark;
Randomly selecting prompting points in an error area by taking a prediction mask and a labeling result obtained in the previous iteration as initial data, taking the prediction mask of the previous iteration as a prompting of a new iteration, taking a mask with the highest cross-over ratio as a prompting of the next iteration after a plurality of masks are obtained, and sampling the prompting points;
stopping iteration when the iteration times reach a preset value or the prediction mask reaches preset accuracy.
8. A transformer-based interactive microbial image segmentation device for implementing the method according to any one of claims 1-7, characterized in that the device comprises:
The image acquisition unit is used for responding to the prompt instruction and acquiring an original image to be segmented, wherein the original image is provided with a prompt mark corresponding to the prompt instruction;
the result generation unit is used for inputting the original image and the prompt mark into a pre-trained image segmentation model so as to obtain a segmentation result output by the image segmentation model;
wherein the network structure of the image segmentation model comprises:
The encoder is used for extracting the characteristics of the input original image in multiple dimensions, combining the types of the prompt marks and fusing the image characteristics and the prompt mark characteristics to obtain fusion characteristics; the types of the prompt marks comprise sparse type and dense type;
And the decoder is used for decoding the fusion characteristic to obtain a segmentation mask of the target region.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1-7 when the program is executed.
10. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any of claims 1-7.
CN202410308761.7A 2024-03-19 2024-03-19 Method and device for segmenting interactive microorganism image based on transformer Pending CN117911430A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410308761.7A CN117911430A (en) 2024-03-19 2024-03-19 Method and device for segmenting interactive microorganism image based on transformer

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410308761.7A CN117911430A (en) 2024-03-19 2024-03-19 Method and device for segmenting interactive microorganism image based on transformer

Publications (1)

Publication Number Publication Date
CN117911430A true CN117911430A (en) 2024-04-19

Family

ID=90685406

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410308761.7A Pending CN117911430A (en) 2024-03-19 2024-03-19 Method and device for segmenting interactive microorganism image based on transformer

Country Status (1)

Country Link
CN (1) CN117911430A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220036124A1 (en) * 2020-07-31 2022-02-03 Sensetime Group Limited Image processing method and device, and computer-readable storage medium
CN116958163A (en) * 2023-09-20 2023-10-27 海杰亚(北京)医疗器械有限公司 Multi-organ and/or focus medical image segmentation method and device
CN117671138A (en) * 2023-11-28 2024-03-08 山东大学 Digital twin modeling method and system based on SAM large model and NeRF
CN117671688A (en) * 2023-12-07 2024-03-08 北京智源人工智能研究院 Segmentation recognition and text description method and system based on hintable segmentation model

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220036124A1 (en) * 2020-07-31 2022-02-03 Sensetime Group Limited Image processing method and device, and computer-readable storage medium
CN116958163A (en) * 2023-09-20 2023-10-27 海杰亚(北京)医疗器械有限公司 Multi-organ and/or focus medical image segmentation method and device
CN117671138A (en) * 2023-11-28 2024-03-08 山东大学 Digital twin modeling method and system based on SAM large model and NeRF
CN117671688A (en) * 2023-12-07 2024-03-08 北京智源人工智能研究院 Segmentation recognition and text description method and system based on hintable segmentation model

Similar Documents

Publication Publication Date Title
WO2023134073A1 (en) Artificial intelligence-based image description generation method and apparatus, device, and medium
CN111723585B (en) Style-controllable image text real-time translation and conversion method
CN115601549B (en) River and lake remote sensing image segmentation method based on deformable convolution and self-attention model
CN112232149B (en) Document multimode information and relation extraction method and system
CN111783705B (en) Character recognition method and system based on attention mechanism
JP2020017274A (en) System and method for recognizing end-to-end handwritten text using neural network
US20210232848A1 (en) Apparatus and method for processing image data
CN107111782B (en) Neural network structure and method thereof
CN113221879A (en) Text recognition and model training method, device, equipment and storage medium
WO2017209660A1 (en) Learnable visual markers and method of their production
CN112258436A (en) Training method and device of image processing model, image processing method and model
CN111210382A (en) Image processing method, image processing device, computer equipment and storage medium
CN114596566A (en) Text recognition method and related device
CN117635771A (en) Scene text editing method and device based on semi-supervised contrast learning
CN114565789B (en) Text detection method, system, device and medium based on set prediction
CN116434252A (en) Training of image recognition model and image recognition method, device, medium and equipment
CN113538530B (en) Ear medical image segmentation method and device, electronic equipment and storage medium
CN114332479A (en) Training method of target detection model and related device
CN117237623B (en) Semantic segmentation method and system for remote sensing image of unmanned aerial vehicle
CN117392293A (en) Image processing method, device, electronic equipment and storage medium
CN117253044A (en) Farmland remote sensing image segmentation method based on semi-supervised interactive learning
CN117911430A (en) Method and device for segmenting interactive microorganism image based on transformer
CN115909378A (en) Document text detection model training method and document text detection method
CN115995027A (en) Self-supervision scene change detection method, system and computer readable medium
CN117496025B (en) Multi-mode scene generation method based on relation and style perception

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination