CN117788981A

CN117788981A - Zero sample image segmentation model training method and device based on multiple modes

Info

Publication number: CN117788981A
Application number: CN202410161421.6A
Authority: CN
Inventors: 石雅洁
Original assignee: Shenzhen Xumi Yuntu Space Technology Co Ltd
Current assignee: Shenzhen Xumi Yuntu Space Technology Co Ltd
Priority date: 2024-02-02
Filing date: 2024-02-02
Publication date: 2024-03-29

Abstract

The application relates to the technical field of computer vision processing, and provides a zero sample image segmentation model training method and device based on multiple modes. The method comprises the following steps: the method comprises the steps of training a basic segmentation module by using an image encoder, a prompt word encoder and a mask decoder to obtain a trained basic segmentation module, training a segmentation fine adjustment module by using a language model, a text encoder and a significance detection module to obtain a trained segmentation fine adjustment module, splicing the trained basic segmentation module with the segmentation fine adjustment module, connecting a cross attention fusion module with a preset decoder to obtain a zero sample image segmentation model, and training the zero sample image segmentation model again by using a training set to obtain a trained zero sample image segmentation model. The method and the device solve the problem that the accuracy of the segmentation result of the unseen object type and the small object in the prior art is not high.

Description

Zero sample image segmentation model training method and device based on multiple modes

Technical Field

The application relates to the technical field of computer vision processing, in particular to a zero sample image segmentation model training method and device based on multiple modes.

Background

Image segmentation techniques have wide application in the fields of computer vision and image processing, such as separating different targets or objects in an image from a background, classifying different areas in an image, or applying to behavioral analysis of a monitoring system, etc. using image segmentation techniques. Sample data of all possible target categories are often required to be contacted in the training stage of the image segmentation model so as to accurately divide pixels in the image into different categories.

However, in practical applications, some new target classes are often encountered, and no corresponding sample data is available for training, so a zero-sample segmentation algorithm is introduced to solve the problem of zero samples in the segmentation task, and the zero-sample segmentation algorithm can learn a model by using sample data of a known class and auxiliary information (such as semantic description, attribute features and the like) related to the target class, so as to match pixels in an image with the known class, and use the auxiliary information for reasoning, so as to segment pixels in the unknown class in the image. It is also because the segmentation model does not contact the samples of the relevant data during the training process, which results in the zero sample image segmentation model not having sufficient understanding ability for the features and shapes of new objects or small objects, and results in the zero sample image segmentation model not being accurate enough for the output of the segmentation of images containing unseen object types and small objects.

Therefore, the prior art has the problem of low accuracy of the segmentation result for the types of objects which are not seen.

Disclosure of Invention

In view of this, the embodiment of the application provides a zero sample image segmentation model training method and device based on multiple modes, so as to solve the problem in the prior art that the accuracy of the segmentation results of unseen object types and small objects is not high.

In a first aspect of an embodiment of the present application, a zero sample image segmentation model training method based on multiple modes is provided, including: acquiring images in an image training set and prompt words in a prompt word training set, wherein the prompt words are determined based on the content of the images; inputting the image and the prompt word into a basic segmentation module, processing the image by using an image encoder to obtain image coding features, and processing the prompt word by using a prompt word encoder to obtain the prompt word coding features; processing the image coding features and the prompt word coding features through a mask decoder to obtain corresponding masks, processing the masks through a full-connection layer, calculating a first loss function, and training the basic segmentation module based on the first loss function to obtain a trained basic segmentation module; inputting the image into a segmentation fine adjustment module, performing cross-modal processing and saliency detection processing on the image respectively to obtain text embedded features and a salient feature map, and performing cross-attention fusion on the text embedded features and the salient feature map to obtain a first fusion feature map; processing the first fusion feature map through a multi-layer perceptron and a full-connection layer, calculating a second loss function, and training the segmentation fine adjustment module based on the second loss function to obtain a trained segmentation fine adjustment module; splicing the trained basic segmentation module and the trained segmentation fine adjustment module, and connecting the cross attention fusion module and a preset decoder to obtain a zero sample image segmentation model; inputting the images and the prompt words into the zero sample image segmentation model for processing, calculating a third loss function, and training the zero sample image segmentation model based on the third loss function to obtain a trained zero sample image segmentation model.

In a second aspect of the embodiments of the present application, a zero sample image segmentation model training device based on multiple modes is provided, including: the acquisition module is configured to acquire images in the image training set and prompt words in the prompt word training set, and the prompt words are determined based on the content of the images; the first processing module is configured to input the image and the prompt word into the basic segmentation module, process the image by using the image encoder to obtain image coding characteristics, and process the prompt word by using the prompt word encoder to obtain the prompt word coding characteristics; the first training module is configured to process the image coding features and the prompt word coding features through a mask decoder to obtain corresponding masks, process the masks through a full-connection layer, calculate a first loss function, train the basic segmentation module based on the first loss function, and obtain a trained basic segmentation module; the second processing module is configured to input the image into the segmentation fine adjustment module, perform cross-modal processing and saliency detection processing on the image respectively to obtain text embedded features and a salient feature map, and perform cross-attention fusion on the text embedded features and the salient feature map to obtain a first fusion feature map; the second training module is configured to process the first fusion feature map through the multi-layer perceptron and the full-connection layer, calculate a second loss function, train the segmentation fine adjustment module based on the second loss function, and obtain a trained segmentation fine adjustment module; the third processing module is configured to splice the trained basic segmentation module and the trained segmentation fine adjustment module, and is connected with the cross attention fusion module and the preset decoder to obtain a zero sample image segmentation model; the third training module is configured to input the images and the prompt words into the zero sample image segmentation model for processing, calculate a third loss function, and train the zero sample image segmentation model based on the third loss function to obtain a trained zero sample image segmentation model.

In a third aspect of the embodiments of the present application, there is provided an electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the above method when executing the computer program.

In a fourth aspect of the embodiments of the present application, there is provided a computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of the above method.

The above-mentioned at least one technical scheme that this application embodiment adopted can reach following beneficial effect:

and training the basic segmentation module by utilizing the image training set and the prompt word training set through constructing the basic segmentation module and the segmentation fine adjustment module to obtain the trained basic segmentation module. The method comprises the steps of training a segmentation fine adjustment module by using an image training set to obtain a trained segmentation fine adjustment module, splicing the trained basic segmentation module with the trained segmentation fine adjustment module, connecting a cross attention fusion module with a preset decoder to obtain a zero sample image segmentation model, and training the zero sample image segmentation model by using the image training set and a prompt word training set to obtain a trained zero sample image segmentation model. An adapter module is configured in the image encoder in the basic segmentation fine adjustment module and is used for updating parameters of the image encoder, so that the image encoder is trained through parameter updating of the adapter module, the image encoder with stronger semantic understanding capability can be obtained, and a multi-mode comparison effect is achieved through training of the image and the prompting word model, so that the model performance is improved. The trained zero sample image segmentation model can obtain accurate segmentation results when the model is used for the types of objects which are not seen and small objects.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the following description will briefly introduce the drawings that are needed in the embodiments or the description of the prior art, it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of a multi-mode-based zero sample image segmentation model training method according to an embodiment of the present application;

FIG. 2 is a flowchart of another multi-mode-based zero-sample image segmentation model training method according to an embodiment of the present application;

FIG. 3 is a flowchart of another multi-mode-based zero-sample image segmentation model training method according to an embodiment of the present application;

FIG. 4 is a flowchart of another multi-mode-based zero sample image segmentation model training method according to an embodiment of the present application;

FIG. 5 is a flowchart of another method for training a zero sample image segmentation model based on multiple modes according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a multi-mode-based zero sample image segmentation model training device according to an embodiment of the present application;

Fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system configurations, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

The terms first, second and the like in the description and in the claims, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged, as appropriate, such that embodiments of the present application may be implemented in sequences other than those illustrated or described herein, and that the objects identified by "first," "second," etc. are generally of a type and not limited to the number of objects, e.g., the first object may be one or more. Furthermore, in the description and claims, "and/or" means at least one of the connected objects, and the character "/", generally means that the associated object is an "or" relationship.

Furthermore, it should be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.

A multi-mode zero-sample image segmentation model training method and apparatus according to embodiments of the present application will be described in detail below with reference to the accompanying drawings.

Fig. 1 is a flow chart of a multi-mode-based zero sample image segmentation model training method according to an embodiment of the present application. As shown in fig. 1, the multi-mode-based zero sample image segmentation model training method includes:

s101, acquiring images in an image training set and prompt words in a prompt word training set, wherein the prompt words are determined based on the content of the images;

s102, inputting an image and a prompt word into a basic segmentation module, processing the image by using an image encoder to obtain image coding features, and processing the prompt word by using a prompt word encoder to obtain the prompt word coding features;

S103, processing the image coding features and the prompt word coding features through a mask decoder to obtain corresponding masks, processing the masks through a full connection layer, calculating a first loss function, and training the basic segmentation module based on the first loss function to obtain a trained basic segmentation module;

s104, inputting the image into a segmentation fine adjustment module, respectively performing cross-mode processing and saliency detection processing on the image to obtain text embedded features and a salient feature map, and performing cross-attention fusion on the text embedded features and the salient feature map to obtain a first fusion feature map;

s105, processing the first fusion feature map through a multi-layer perceptron and a full-connection layer, calculating a second loss function, and training the segmentation fine adjustment module based on the second loss function to obtain a trained segmentation fine adjustment module;

s106, splicing the trained basic segmentation module and the trained segmentation fine adjustment module, and connecting the cross attention fusion module and a preset decoder to obtain a zero sample image segmentation model;

s107, inputting the images and the prompt words into the zero sample image segmentation model for processing, calculating a third loss function, and training the zero sample image segmentation model based on the third loss function to obtain a trained zero sample image segmentation model.

In the training process of the zero sample image segmentation model, the image training set and the prompt word training set can be utilized to train the basic segmentation module, the image training set is utilized to train the segmentation fine adjustment module, the trained basic segmentation module and the trained segmentation fine adjustment module are obtained, the two modules are spliced and connected with the cross attention fusion module and the preset decoder to obtain the complete zero sample image segmentation model, then the image training set and the prompt word training set can be utilized to train the zero sample image segmentation model, the trained zero sample image segmentation model can be obtained, the zero sample image segmentation model after training can be utilized to segment images to be segmented, and the unseen object types and small objects can be accurately segmented in the processing process.

In the training process of the basic segmentation module, an image encoder of a pre-trained segmentation model can be directly adopted to deploy an adapter module for the basic segmentation module, all parameters of the original structure of the image encoder are frozen during training, and the adapter module is used as a module for updating parameters in the training process of an image processor.

In some embodiments, processing an image with an image encoder to obtain image encoding features includes: processing the image by using a plurality of converter modules to obtain image coding characteristics, wherein the converter modules comprise a first normalization layer, a multi-head attention layer, an adapter module, a second normalization layer and a multi-layer perceptron; the output of the multi-head attention layer is processed according to the adapter module, so that the input of the second normalization layer is obtained, and the adapter module is configured as a parameter updating module of the image encoder.

In particular, the image encoder may be spliced by a plurality of converter modules, and the number of the converter modules is not particularly limited here. In the image encoder processing process, a plurality of first normalization layers, multi-head attention layers, an adapter module, a second normalization layer and a multi-layer perceptron can be utilized to process the input of the image encoder in sequence, so that image coding characteristics are obtained. It will be appreciated that the converter modules, such as the converter modules, may each include a Layer Norm, multi-head Attention, layer Norm, MLP, and an adapter module, respectively, which may be inserted behind the Multi-head Attention, and thus may be understood as processing the output of the Multi-head Attention Layer as the input of the adapter module, resulting in the input of the second normalization Layer.

It should be noted that, the original structure of each converter module may include a first normalization layer, a multi-head attention layer, a second normalization layer and a multi-layer perceptron.

The structures and the processing effects of the first normalization layer and the second normalization layer can be consistent.

According to the scheme provided by the embodiment, the combination of the converter modules is used for constructing the image encoder, through using the converter modules for multiple times, the image encoder can gradually extract rich characteristic representations, so that image content can be better understood, the image encoding characteristics can be used for subsequent image segmentation tasks, the configured adapter modules are used as parameter updating modules in the training process of the image encoder, the expression capacity and the adaptability of the image encoder can be further enhanced, and the training effect and the generalization capacity of a zero-sample image segmentation model constructed later are improved, so that the performance of the model is improved.

In other embodiments, processing the output of the multi-head attention layer according to the adapter module to obtain the input of the second normalization layer includes: in the adapter module, the output of the multi-head attention layer is sequentially processed by utilizing the first linear layer, the compression linear layer, the activation layer, the restoration linear layer and the second linear layer to obtain the input of the second normalization layer; and multiplying the result of the output of the first linear layer after the processing of the first activation function by the output of the recovery linear layer to obtain the input of the second linear layer, splicing the result of the input of the adapter module after the global average pooling processing with the output of the second linear layer to obtain the output of the adapter module, and taking the output of the adapter module as the input of the second normalization layer.

In particular, each adapter module may include a first linear layer, a compressed linear layer, an active layer, a restored linear layer, and a second linear layer.

Fig. 2 is a flowchart of another multi-mode zero-sample image segmentation model training method according to an embodiment of the present application, and a processing procedure of the adapter module is described below with reference to fig. 2.

In the adapter module, the input of the adapter module can be processed by sequentially utilizing a first linear layer, a compression linear layer, an activation layer, a recovery linear layer and a second linear layer to obtain the input of a second normalization layer, wherein the first linear layer and the second linear layer can adopt convolution of 1 multiplied by 1; the compressed linear layer may be convolved by 3 x 3, mapping the output of the first linear layer to a higher dimension via the compressed linear layer for more feature learning; the activation layer can further perform activation function processing on the output of the compression linear layer by adopting a Relu activation function, so that the expression capacity of the features is increased; the restoring linear layer may also adopt a convolution of 3×3, and the output after the activation function processing is subjected to linear transformation again, and mapped back to the original feature space, and the dimension is restored to the original dimension, so as to be fused with the backbone network subsequently, where the backbone network may be represented as a backbone network of a transducer module.

The output of the first linear layer may be processed by a first activation function (e.g., a sigmoid function) to obtain a weight of a channel dimension of the output of the first linear layer, the weight may be multiplied by the output of the restored linear layer to serve as an input of the second linear layer, further, a result of the input of the adapter module through global average pooling may be spliced with the output of the second linear layer to obtain an output of the adapter module, the global average pooling may help the adapter module capture global features, and the result of the global average pooling may be spliced with the output of the second linear layer to obtain a richer and comprehensive feature representation, thereby helping the image encoder to obtain optimization, and further improving the expression capability and adaptability of the basic segmentation module, thereby helping the zero sample image segmentation model to better understand the input data, i.e., the image to be segmented.

In some embodiments, the processing of the cue words with the cue word encoder to obtain cue word encoding features includes: determining the type of the prompt word according to the image, wherein the prompt word comprises a coordinate point, a detection frame, a text and an initial mask; if the prompting word comprises the coordinate points, mapping the coordinate points to a preset dimension vector to obtain coding features of the coordinate points; if the prompt word comprises the detection frame, respectively representing the upper left corner and the lower right corner of the detection frame by vector features to obtain coding features of the detection frame; if the prompt word comprises the text, extracting text characteristics of the text by using a preset text encoder to obtain coding characteristics of the text; if the prompt word comprises the initial mask, the initial mask is subjected to convolution processing and channel transformation to obtain initial mask features, the initial mask features are multiplied by image features to obtain coding features of the initial mask, and the image features are obtained based on a preset image feature extraction model.

Specifically, in the processing of the input by the basic segmentation module, the processing of the prompt words by the prompt word encoder is also included, and the prompt words can be processed by the prompt word encoder of the pre-trained segmentation module, but the parameters and the structure of the prompt word encoder are not limited here.

Further, the types of the prompt words may include coordinate points, a detection frame, text and an initial mask, it should be understood that it is possible to process a certain type or a plurality of types of the prompt word types in the basic segmentation module, the input type of the prompt word may be determined according to corresponding image content, as an example, when a certain object in an image to be segmented needs to be segmented, the position coordinate point of the object may be provided as the prompt word, the object may be accurately segmented by the help model, and as another example, when the image needs to be segmented according to a word description or a keyword, the text may be used as the prompt word to perform target segmentation, such as given an image containing a "dog", and the target may be segmented by inputting the text "dog".

In the processing process of the prompt word encoder, when the prompt word type comprises coordinate points, the coordinate points can be mapped to 256-dimensional vectors, the vectors can represent coding features representing the positions of the points, and two dimensions can be additionally added to represent the learnable coding features of the points as foreground or background respectively;

When the prompt word type comprises a detection frame, two vector features can be used for respectively representing coding features of an upper left corner and a lower right corner, so that the coding features of the corresponding detection frame represent the position information of the detection frame;

when the prompt word type includes text, the text may be input to a text encoder (e.g., CLIP encoder) to obtain encoding features of the text;

when the hint word type includes an initial mask, the input image may be scaled to 1/4 of the original size as the initial mask, then convolved with a convolution kernel of 2 x 2 size to output channels 4 and 6, then the number of channels is increased to 256 with the 1 x 1 convolution kernel to obtain initial mask features, which may also be multiplied by corresponding picture features in the hint word encoder to obtain the output of the hint word encoder, which may be obtained by a feature extraction model in the hint word encoder, which may be an existing already pre-trained image extraction model (e.g., vision Transformer, viT model).

It should be appreciated that the hint terms may be obtained using image processing techniques based on a training set of images or images to be segmented, or may be manually annotated, which is not limited herein.

According to the scheme provided by the embodiment, in the processing process of the basic segmentation module, the prompt word encoder can be utilized to convert different types of prompt words into the same characteristic representation form so as to process and analyze in subsequent tasks, and thus, information such as coordinate points, detection frames, texts, initial masks and the like can be encoded into vector features with certain semantic and expression capabilities, and input data can be better represented and understood.

In some embodiments, the mask is processed by the full connection layer and a first loss function is calculated, and the base segmentation module is trained based on the first loss function, so as to obtain a trained base segmentation module, which includes: processing the mask through a first function and a second function to obtain a final mask of the image; and calculating a first loss function according to the final mask and the real mask, and training the basic segmentation module based on the first loss function to obtain a trained basic segmentation module.

Specifically, after the basic segmentation module processes the image and the prompt word to obtain a corresponding mask, the mask may be further processed by using the full-connection layer, and a first loss function is calculated according to a final mask obtained by the full-connection layer, and parameters of the basic segmentation module may be reversely updated by using the first loss function to obtain the trained basic segmentation module.

Further, after the image encoder and the prompt word encoder process, the image encoding feature and the prompt word encoding feature are obtained, the image encoding feature and the prompt word encoding feature are input to a mask decoder to be decoded, a corresponding mask (namely, a binary image) can be obtained, the mask can comprise a mask corresponding to the image encoding feature and a mask corresponding to the prompt word encoding feature, then the mask is processed in a full-connection layer through a first function and a second function to obtain a final mask, and the final mask can be expressed as a mask obtained by fusing the mask corresponding to the image encoding feature and the mask corresponding to the prompt word encoding feature. In the obtained final mask, one final mask may contain a plurality of objects, or one final mask may contain only one object, which may be specifically determined according to the image content.

It should be noted that in the fully connected layer, the received mask may be first converted into a vector representation, then the mask is linearly combined by using a first function and a second function to obtain a final mask, and a first Loss function is calculated, as an example, the mask may be linearly combined by using a Focal Loss and a Dice Loss, a weight coefficient of the Focal Loss and the Dice Loss may be 15:1, a score IoU, that is, a cross score, may be calculated by using a mean square error Loss function (MSE Loss), as the first Loss function, and the MSE Loss may be used as a gradient signal passing in a reverse direction in each iteration to update parameters of the basic segmentation module, so that the predicted IoU score is closer to the actual IoU score, thereby optimizing IoU prediction capability of the basic segmentation module in the training process, and when calculating IoU score, the actual mask may be calculated based on the final mask and the actual mask, where the actual mask may be manually preconfigured.

According to the scheme provided by the embodiment of the application, the image coding features and the prompt word coding features are mapped to the final mask, and the first loss function is utilized for training to improve the performance of the basic segmentation module, so that the trained basic segmentation module can better segment the image, and more accurate object distinction is realized.

Further, fig. 3 is a flow chart of another multi-mode-based zero-sample image segmentation model training method according to an embodiment of the present application, and a training process of the basic segmentation module is further described below with reference to fig. 3.

Firstly, an image and a prompt word can be input into a basic segmentation model, an image encoder is utilized to process the image to obtain image coding characteristics, the image encoder can be spliced by a plurality of converter modules, an adapter module can be configured in each converter module in advance, and the adapter module is utilized as a parameter updating module of the image encoder to train the image encoder.

The prompt word is processed by a prompt word encoder to obtain a prompt word coding feature, wherein the prompt word type can comprise coordinate points, detection boxes, texts and an initial mask (Marsk).

Then, the mask decoder can be utilized to decode the image coding feature and the prompt word coding feature to obtain masks corresponding to the image coding feature and the prompt word coding feature, the masks are processed through the full-connection layer to obtain a segmentation result output, namely a final mask and a first loss function, and the basic segmentation module can be reversely and iteratively updated according to the first loss function to obtain the trained basic segmentation module.

In some embodiments, inputting an image into a segmentation fine adjustment module, performing cross-modal processing and saliency detection processing on the image respectively to obtain a text embedded feature and a salient feature map, and performing cross-attention fusion on the text embedded feature and the salient feature map to obtain a first fused feature map, including: performing cross-modal processing on the image by using the language model and the text encoder to obtain text embedded features; extracting the salient features of the image by using a salient detection module to obtain a salient feature map; and carrying out cross attention fusion on the text embedded feature and the salient feature map to obtain a first fusion feature map.

Specifically, the image training set can be utilized to train the segmentation fine adjustment module, in the training process of the segmentation fine adjustment module, the image can be input into the segmentation fine adjustment module, the image is subjected to cross-modal processing and saliency detection processing to obtain text embedded features and a salient feature map, and then the text embedded features and the salient feature map are subjected to cross-attention fusion to obtain a first fusion feature map.

Further, firstly, the attribute description of the image can be extracted by using the language model, then the attribute description is input into a text encoder for encoding, so as to obtain text embedded features, meanwhile, the image is subjected to a saliency detection module to obtain a saliency feature image of the image, then the text embedded features are used as a query (Q), the saliency feature image is used as a key (K) and a value (V), cross attention fusion is carried out, so as to obtain a first fusion feature image, and a specific cross attention fusion mechanism is a well-known technology in the field and is not developed here.

In this way, through cross-modal processing and salient feature extraction of the image and fusion of text embedded features and salient feature images, text information and salient feature information of the image can be integrated together and used for subsequent image segmentation tasks, so that accuracy and semantic understanding capability of segmentation results are improved.

Fig. 4 is a flowchart of another method for training a zero-sample image segmentation model based on multiple modes according to an embodiment of the present application, and a training process of the segmentation refinement model is further described below according to fig. 4.

Firstly, an image can be input into a segmentation fine adjustment module to perform cross-mode processing and saliency detection respectively, namely, the image is processed by a Language Model (LM) and a text encoder (such as a CLIP text encoder) to obtain text embedded features. And processing the image by using a saliency detection module to obtain a saliency characteristic diagram.

Then, the text embedded feature and the saliency feature map are subjected to cross attention fusion to obtain a first fusion feature map, the first fusion feature map is processed by a multi-layer perceptron (Multilayer Perceptron, MLP), then the processing result is input into a full-connection layer, whether text description information obtained by the LM is matched with an image or not can be predicted by adopting a sigmoid cross loss function, so that the segmentation fine adjustment module is reversely updated, and the segmentation fine adjustment module after training is obtained through iterative training.

In some embodiments, inputting the image and the prompt word into the zero sample image segmentation model for processing, calculating a third loss function, and training the zero sample image segmentation model based on the third loss function to obtain a trained zero sample image segmentation model, including: in a zero sample image segmentation model, processing the image and the prompt word by using a trained basic segmentation module to obtain a first output result; processing the image by using the trained segmentation fine adjustment module to obtain a second output result; cross attention fusion is carried out on the first output result and the second output result, and a second fusion feature diagram is obtained; inputting the second fusion feature map to a preset decoder for processing to obtain a segmentation result, and calculating a third loss function based on the segmentation result and the real mask; and updating the zero sample image segmentation model parameters according to the third loss function to obtain a trained zero sample image segmentation model.

Specifically, the zero sample image segmentation model can be obtained by splicing a trained basic segmentation module and a trained segmentation fine adjustment module, and the zero sample image segmentation model can be trained by utilizing an image training set and a prompt word training set before the zero sample image segmentation model is applied.

Specifically, the images and the prompt words in the training set may be input into a basic segmentation module of the zero-sample image segmentation model, where the basic segmentation module is a trained module, and a preliminary segmentation result, that is, a first output result, may be generated according to the images and the prompt words. Meanwhile, the image can be input into a segmentation fine adjustment module of the zero sample image segmentation model, the image is further processed to obtain image-text fusion, namely a second output result, the first output result and the second output result are fused by using a cross attention mechanism, and a second fusion feature diagram is obtained, wherein the cross attention mechanism can adaptively adjust the attention degree according to the attention weight between the first output result and the second output result, so that the two are organically combined together.

And then, processing the second fusion feature map by using a preset decoder (for example, a transducer decoder) to obtain a segmentation result, calculating the difference between the segmentation result and the real mask, namely, a second loss function, updating parameters of the zero sample image segmentation model according to the second loss function, and continuously optimizing the performance of the model through iterative training, so that the trained zero sample image segmentation model can be obtained after multiple rounds of training.

It should be appreciated that when generating the segmentation result, the masks may be linearly combined by using the focal loss and the trace loss, the weight ratio may be 1:1, the second loss function may be obtained by calculating the cross-over score, and the technical means may be consistent with the full connection layer processing in the training process of the basic segmentation module, which is not repeated herein.

Fig. 5 is a flow chart of another method for training a zero-sample image segmentation model based on multiple modes according to an embodiment of the present application, and the processing procedure of the trained zero-sample image segmentation model is further described below with reference to fig. 5.

In the trained zero sample image segmentation model, firstly, an image to be segmented and a corresponding prompt word are input into a basic segmentation module, an image encoder configured with an adapter module encodes the image to be segmented to obtain encoding characteristics of the image, and the encoding characteristics of the image are input into a mask decoder for decoding to obtain a mask corresponding to the encoding characteristics of the image; meanwhile, a prompt word encoder is utilized to encode the prompt word to obtain a prompt word encoding feature, the prompt word can comprise a coordinate point, a detection frame, a text and an initial mask, then the prompt word encoding feature is processed by a mask decoder to obtain a mask corresponding to the prompt word encoding feature, and then the mask processed by a basic segmentation module is input into a cross attention fusion module to be subjected to cross attention fusion with a first fusion feature map processed by a segmentation fine adjustment module.

In the segmentation fine adjustment module, the received image can be subjected to cross-modal processing and saliency detection processing respectively to obtain a text embedded feature and a saliency feature map, wherein the cross-modal processing can be performed by using a language model and a CLIP text encoder to obtain the text embedded feature, and then the text embedded feature and the saliency feature map are subjected to cross-attention fusion to obtain a first fusion feature map, namely a picture-text fusion feature, and it is understood that in the same segmentation person, the image received in the segmentation fine adjustment module is identical to the image received in the basic segmentation module.

Further, the first fusion feature map and the mask are fused again by using a cross attention mechanism to obtain a second fusion feature map, and then the second fusion feature map is processed by a transducer decoder to obtain a result output of the zero sample image segmentation model, wherein the result output can comprise a final mask of image segmentation and IoU score, so that the performance of the zero sample image segmentation model can be evaluated through IoU score, namely, parameters of the model can be updated based on IoU score in the application process of the zero sample image segmentation model, the model is further helped to improve the accuracy of image segmentation, and the understanding capability of the model is also improved, so that the model is helped to accurately segment the types of objects and small objects which are not seen.

It should be appreciated that the attention mechanisms mentioned in the embodiments of the present application may all be the same attention mechanism, i.e. the cross-attention fusion process in the segmentation fine-tuning module may be consistent with another cross-attention fusion process in the zero-sample image segmentation model.

Therefore, the zero sample image segmentation model is trained to obtain the trained zero sample image segmentation model, so that the model can perform multi-mode learning through prompt word coding and image coding under the condition that no sample is marked, the learning ability of the model on the types of objects which are not seen and small objects is improved, and the performance of the zero sample image segmentation model is improved.

Any combination of the above optional solutions may be adopted to form an optional embodiment of the present application, which is not described herein in detail.

The following are device embodiments of the present application, which may be used to perform method embodiments of the present application. For details not disclosed in the device embodiments of the present application, please refer to the method embodiments of the present application.

Fig. 6 is a schematic diagram of a multi-mode-based zero-sample image segmentation model training device according to an embodiment of the present application. As shown in fig. 6, the multi-mode-based zero-sample image segmentation model training device includes:

An acquisition module 601 configured to acquire images in the image training set and prompt words in the prompt word training set, the prompt words being determined based on content of the images;

the first processing module 602 is configured to input the image and the prompt word into the basic segmentation module, process the image by using the image encoder to obtain image coding features, and process the prompt word by using the prompt word encoder to obtain prompt word coding features;

the first training module 603 is configured to process the image coding feature and the prompt word coding feature through a mask decoder to obtain a corresponding mask, process the mask through a full connection layer, calculate a first loss function, and train the basic segmentation module based on the first loss function to obtain a trained basic segmentation module;

the second processing module 604 is configured to input the image into the segmentation fine adjustment module, perform cross-modal processing and saliency detection processing on the image respectively to obtain a text embedded feature and a salient feature map, and perform cross-attention fusion on the text embedded feature and the salient feature map to obtain a first fused feature map;

the second training module 605 is configured to process the first fused feature map through the multi-layer perceptron and the full-connection layer, calculate a second loss function, and train the segmentation fine adjustment module based on the second loss function to obtain a trained segmentation fine adjustment module;

A third processing module 606, configured to splice the trained basic segmentation module and the trained segmentation fine adjustment module, and connect the cross attention fusion module and the preset decoder to obtain a zero sample image segmentation model;

the third training module 607 is configured to input the image and the prompt word to the zero sample image segmentation model for processing, calculate a third loss function, and train the zero sample image segmentation model based on the third loss function to obtain a trained zero sample image segmentation model.

In some embodiments, the first processing module 602 is specifically configured to process the image with a plurality of converter modules to obtain image coding features, where the converter modules include a first normalization layer, a multi-head attention layer, an adapter module, a second normalization layer, and a multi-layer perceptron; the output of the multi-head attention layer is processed according to the adapter module, so that the input of the second normalization layer is obtained, and the adapter module is configured as a parameter updating module of the image encoder.

In some embodiments, the first processing module 602 is specifically configured to process, in the adapter module, the output of the multi-head attention layer sequentially by using the first linear layer, the compressed linear layer, the active layer, the restored linear layer, and the second linear layer, to obtain the input of the second normalization layer; and multiplying the result of the output of the first linear layer after the processing of the first activation function by the output of the recovery linear layer to obtain the input of the second linear layer, splicing the result of the input of the adapter module after the global average pooling processing with the output of the second linear layer to obtain the output of the adapter module, and taking the output of the adapter module as the input of the second normalization layer.

In some embodiments, the first processing module 602 is specifically configured to determine, according to the image, a type of a hint word, where the hint word includes a coordinate point, a detection frame, text, and an initial mask; if the prompting word comprises a coordinate point, mapping the coordinate point to a preset dimension vector to obtain the coding characteristic of the coordinate point; if the prompt word comprises a detection frame, respectively representing an upper left corner and a lower right corner of the detection frame by vector features to obtain coding features of the detection frame; if the prompt word comprises the text, extracting text characteristics of the text by using a preset text encoder to obtain coding characteristics of the text; if the prompt word comprises an initial mask, the initial mask is subjected to convolution processing and channel transformation to obtain initial mask features, the initial mask features are multiplied by image features to obtain coding features of the initial mask, and the image features are obtained based on a preset image feature extraction model.

In some embodiments, the first training module 603 is specifically configured to process the mask through a first function and a second function to obtain a final mask of the image; and calculating a first loss function according to the final mask and the real mask, and training the basic segmentation module based on the first loss function to obtain a trained basic segmentation module.

In some embodiments, the second processing module 604 is specifically configured to perform cross-modal processing on the image using the language model and the text encoder to obtain a text embedding feature; extracting the salient features of the image by using a salient detection module to obtain a salient feature map; and carrying out cross attention fusion on the text embedded feature and the salient feature map to obtain a first fusion feature map.

In some embodiments, the third training module 607 is specifically configured to process, in the zero-sample image segmentation model, the image and the prompt word by using the trained basic segmentation module, to obtain a first output result; processing the image by using the trained segmentation fine adjustment module to obtain a second output result; cross attention fusion is carried out on the first output result and the second output result, and a second fusion feature diagram is obtained; inputting the second fusion feature map to a preset decoder for processing to obtain a segmentation result, and calculating a third loss function based on the segmentation result and the real mask; and updating the zero sample image segmentation model parameters according to the third loss function to obtain a trained zero sample image segmentation model.

It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic of each process, and should not limit the implementation process of the embodiment of the present application in any way.

Fig. 7 is a schematic diagram of an electronic device 7 provided in an embodiment of the present application. As shown in fig. 7, the electronic device 7 of this embodiment includes: a processor 701, a memory 702 and a computer program 703 stored in the memory 702 and executable on the processor 701. The steps of the various method embodiments described above are implemented by the processor 701 when executing the computer program 703. Alternatively, the processor 701, when executing the computer program 703, performs the functions of the modules/units of the apparatus embodiments described above.

The electronic device 7 may be a desktop computer, a notebook computer, a palm computer, a cloud server, or the like. The electronic device 7 may include, but is not limited to, a processor 701 and a memory 702. It will be appreciated by those skilled in the art that fig. 7 is merely an example of the electronic device 7 and is not limiting of the electronic device 7 and may include more or fewer components than shown, or different components.

The processor 701 may be a central processing unit (Central Processing Unit, CPU) or other general purpose processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like.

The memory 702 may be an internal storage unit of the electronic device 7, for example, a hard disk or a memory of the electronic device 7. The memory 702 may also be an external storage device of the electronic device 7, for example, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like provided on the electronic device 7. The memory 702 may also include both internal storage units and external storage devices of the electronic device 7. The memory 702 is used to store computer programs and other programs and data required by the electronic device.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit.

The integrated modules/units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present application implements all or part of the flow in the methods of the above embodiments, or may be implemented by a computer program to instruct related hardware, and the computer program may be stored in a computer readable storage medium, where the computer program may implement the steps of the respective method embodiments described above when executed by a processor. The computer program may comprise computer program code, which may be in source code form, object code form, executable file or in some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth.

The above embodiments are only for illustrating the technical solution of the present application, and are not limiting thereof; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims

1. A zero sample image segmentation model training method based on multiple modes is characterized by comprising the following steps:

acquiring images in an image training set and prompt words in a prompt word training set, wherein the prompt words are determined based on the content of the images;

inputting the image and the prompt word into a basic segmentation module, processing the image by using an image encoder to obtain image coding features, and processing the prompt word by using a prompt word encoder to obtain prompt word coding features;

processing the image coding features and the prompt word coding features through a mask decoder to obtain corresponding masks, processing the masks through a full connection layer, calculating a first loss function, and training the basic segmentation module based on the first loss function to obtain a trained basic segmentation module;

inputting the image into a segmentation fine adjustment module, performing cross-modal processing and saliency detection processing on the image respectively to obtain text embedded features and a salient feature map, and performing cross-attention fusion on the text embedded features and the salient feature map to obtain a first fusion feature map;

processing the first fusion feature map through a multi-layer perceptron and a full-connection layer, calculating a second loss function, and training the segmentation fine adjustment module based on the second loss function to obtain a trained segmentation fine adjustment module;

Splicing the trained basic segmentation module and the trained segmentation fine adjustment module, and connecting a cross attention fusion module and a preset decoder to obtain a zero sample image segmentation model;

and inputting the images and the prompt words into the zero sample image segmentation model for processing, calculating a third loss function, and training the zero sample image segmentation model based on the third loss function to obtain a trained zero sample image segmentation model.

2. The method of claim 1, wherein processing the image with an image encoder results in image encoding features, comprising:

processing the image by using a plurality of converter modules to obtain the image coding characteristics, wherein the converter modules comprise a first normalization layer, a multi-head attention layer, an adapter module, a second normalization layer and a multi-layer perceptron;

and processing the output of the multi-head attention layer according to the adapter module to obtain the input of the second normalization layer, wherein the adapter module is configured as a parameter updating module of the image encoder.

3. The method of claim 2, wherein said processing the output of the multi-headed attention layer according to the adapter module to obtain the input of the second normalization layer comprises:

In the adapter module, the output of the multi-head attention layer is sequentially processed by utilizing a first linear layer, a compression linear layer, an activation layer, a restoration linear layer and a second linear layer to obtain the input of the second normalization layer;

and the output of the adapter module is obtained by splicing the result of the global average pooling processing of the input of the adapter module with the output of the second linear layer, and the output of the adapter module is used as the input of the second normalization layer.

4. The method of claim 1, wherein the processing the hint word with a hint word encoder to obtain a hint word encoding feature comprises:

determining the type of the prompt word according to the image, wherein the prompt word comprises a coordinate point, a detection frame, a text and an initial mask;

if the prompting word comprises the coordinate point, mapping the coordinate point to a preset dimension vector to obtain the coding characteristic of the coordinate point;

If the prompt word comprises the detection frame, respectively representing an upper left corner and a lower right corner of the detection frame by vector features to obtain coding features of the detection frame;

if the prompt word comprises the text, extracting text characteristics of the text by using a preset text encoder to obtain coding characteristics of the text;

if the prompt word comprises the initial mask, carrying out convolution processing and channel transformation on the initial mask to obtain initial mask features, multiplying the initial mask features by image features to obtain coding features of the initial mask, wherein the image features are obtained based on a preset image feature extraction model.

5. The method of claim 1, wherein the performing the mask through the full connection layer processing and calculating a first loss function, training the base segmentation module based on the first loss function, and obtaining a trained base segmentation module comprises:

processing the mask through a first function and a second function to obtain a final mask of the image;

and calculating the first loss function according to the final mask and the real mask, and training the basic segmentation module based on the first loss function to obtain the trained basic segmentation module.

6. The method of claim 1, wherein inputting the image into a segmentation fine adjustment module, performing cross-modal processing and saliency detection processing on the image to obtain a text embedded feature and a salient feature map, and performing cross-attention fusion on the text embedded feature and the salient feature map to obtain a first fused feature map, comprises:

performing cross-modal processing on the image by using a language model and a text encoder to obtain the text embedded feature;

extracting salient features of the image by using a salient detection module to obtain a salient feature map;

and carrying out cross attention fusion on the text embedded feature and the salient feature map to obtain the first fusion feature map.

7. The method of claim 1, wherein the inputting the image and the hint word into the zero sample image segmentation model for processing, and calculating a third loss function, training the zero sample image segmentation model based on the third loss function, and obtaining a trained zero sample image segmentation model comprises:

in the zero sample image segmentation model, the trained basic segmentation module is utilized to process the image and the prompt word, and a first output result is obtained;

Processing the image by using the trained segmentation fine adjustment module to obtain a second output result;

cross attention fusion is carried out on the first output result and the second output result, and a second fusion feature diagram is obtained;

inputting the second fusion feature map to the preset decoder for processing to obtain a segmentation result, and calculating the third loss function based on the segmentation result and a true mask;

and updating the zero sample image segmentation model parameters according to the third loss function to obtain a trained zero sample image segmentation model.

8. A zero sample image segmentation model training device based on multiple modes, comprising:

the acquisition module is configured to acquire images in an image training set and prompt words in a prompt word training set, wherein the prompt words are determined based on the content of the images;

the first processing module is configured to input the image and the prompt word into the basic segmentation module, process the image by using an image encoder to obtain image coding characteristics, and process the prompt word by using a prompt word encoder to obtain prompt word coding characteristics;

the first training module is configured to process the image coding features and the prompt word coding features through a mask decoder to obtain corresponding masks, process the masks through a full-connection layer, calculate a first loss function, train the basic segmentation module based on the first loss function, and obtain a trained basic segmentation module;

The second processing module is configured to input the image into the segmentation fine adjustment module, perform cross-modal processing and saliency detection processing on the image respectively to obtain text embedded features and a salient feature map, and perform cross-attention fusion on the text embedded features and the salient feature map to obtain a first fusion feature map;

the second training module is configured to process the first fusion feature map through a multi-layer perceptron and a full-connection layer, calculate a second loss function, train the segmentation fine adjustment module based on the second loss function, and obtain a trained segmentation fine adjustment module;

the third processing module is configured to splice the trained basic segmentation module and the trained segmentation fine adjustment module, and connect the cross attention fusion module and a preset decoder to obtain a zero sample image segmentation model;

and the third training module is configured to input the images and the prompt words into the zero sample image segmentation model for processing, calculate a third loss function, train the zero sample image segmentation model based on the third loss function, and obtain a trained zero sample image segmentation model.

9. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 7 when the computer program is executed.

10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any one of claims 1 to 7.