CN115359254A

CN115359254A - Vision transform network-based weak supervision instance segmentation method, system and medium

Info

Publication number: CN115359254A
Application number: CN202210877230.0A
Authority: CN
Inventors: 余晋刚; 梁宇琦; 吴仕科
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2022-07-25
Filing date: 2022-07-25
Publication date: 2022-11-18

Abstract

The invention discloses a method, a system and a medium for partitioning a weak supervision instance based on a Vision Transformer network, wherein the method comprises the following steps: acquiring a labeled natural image data set and a natural image to be segmented; constructing a weak supervision example segmentation model; the weak supervision instance segmentation model comprises a ViT multi-label classification module and a ViT candidate region scoring module; the ViT multi-label classification module comprises a Vision Transformer network and a candidate area pseudo label generator; the ViT candidate region scoring module comprises a candidate region generator and a ViT candidate region feature generator; initializing a weak supervision instance segmentation model, constructing a loss function, performing iterative training on a labeled natural image data set, and optimizing the loss function to obtain a trained weak supervision instance segmentation model; and inputting the natural image to be segmented into the trained weak supervision example segmentation model to obtain an example segmentation result. The invention realizes the example segmentation of the natural image, and accelerates the reasoning speed and reduces the consumption of computing power while keeping higher performance.

Description

Vision transform network-based weak supervision instance segmentation method, system and medium

Technical Field

The invention belongs to the technical field of weak supervision instance segmentation, and particularly relates to a method, a system and a medium for weak supervision instance segmentation based on a Vision Transformer network.

Background

Example segmentation is one of the key problems in the field of image understanding and computer vision, and is the task of predicting a single pixel class in an image and providing different example labels for individual examples belonging to the same class. In medical image analysis, the example segmentation can understand data extremely accurately, and the diagnosis efficiency and accuracy are greatly improved; in the fields of robots and automatic driving, the example segmentation provides scene understanding at a pixel level, and recognition efficiency and precision are improved. While achieving high performance example segmentation results requires fine labeling at the pixel level, researchers are more concerned about how to train models with image-level class labeling to approximate example segmentation performance in the fully supervised case in order to save time and money costs. Weakly supervised instance segmentation as a new research direction, whose core is to locate each instance and find explicit instance boundaries, however, presents a series of challenges: firstly, under the condition of only image-level class labeling, a trained multi-label classification network only classifies according to the most distinguishing regional characteristics of objects of various classes in a data set, so that a class activation graph or a saliency graph obtained by a CNN (content-based network) network usually only focuses on incomplete number examples and part of examples of the objects of the same class, and positioning information of different examples has defects; second, it is not easy to find explicit instance boundaries, and CNN networks do not automatically delineate boundaries between instances without pixel-level instance boundary markers.

In order to solve the above technical problems, there are three solutions in the prior art: the first is to generate candidate masks of the image by an advanced method, and the redundant candidate masks probably contain all instances in the picture and have relatively accurate boundaries; secondly, establishing a supervision signal of an example boundary according to a category activation graph or a saliency graph generated by the CNN network and the change of the pixel gray value, and training an example filling module; thirdly, designing a new back propagation mode based on the CNN architecture, and finally obtaining the outline information of the example in the original image by back propagation of the peak value response point of the example; for example, a PRM method proposed by Zhou et al in Weakly Supervised instant Segmentation using Class Peak Response; the WS-RCNN framework proposed by Ou J R et al in the document "Learning to Score probes for week superior distance Segmentation". However, in the prior art, on one hand, due to the defects of a candidate mask scoring mechanism and the problem that the category activation graph generated by the CNN network only focuses on the most significant region, a large number of non-significant examples are often lost, so that the example segmentation result only focuses on the part with the most identification degree, and the effect is poor; on the other hand, in order to ensure higher performance, an advanced large and deep CNN network is used as a basic framework to design an instance segmentation model, and a large amount of computing resources are consumed in the model training process, so that the training and reasoning time is long, and the efficiency is low; in particular, the prior art only addresses the task of instance segmentation in a single computer vision modality (CV), and cannot be directly fused with the task in a natural language processing modality (NLP) to achieve higher-level tasks.

Disclosure of Invention

The invention mainly aims to overcome the defects of the prior art and provide a Vision Transformer-based weak supervision instance segmentation method, a Vision Transformer-based weak supervision instance segmentation system and a Vision Transformer-based weak supervision instance segmentation medium, wherein a class activation graph is generated by learning global information of an image through a Vision Transformer network, and a COB candidate region generated by combining a convolution guide boundary algorithm and a hierarchical segmentation algorithm is used for constructing a pseudo label; and finally, predicting the classification score of the COB candidate region based on the class activation graph through a ViT candidate region feature generator to obtain an example segmentation result, wherein the training parameters are few, the time is short, and the result is accurate and effective.

In order to achieve the purpose, the invention adopts the following technical scheme:

in one aspect, the invention provides a Vision Transformer network-based weak supervision instance segmentation method, which comprises the following steps:

acquiring a natural image data set with a label and a natural image to be segmented;

constructing a weak supervision example segmentation model; the weak supervision instance segmentation model comprises a ViT multi-label classification module and a ViT candidate region scoring module; the ViT multi-label classification module comprises a Vision Transformer network and a candidate region pseudo label generator; the ViT candidate region scoring module comprises a candidate region generator and a ViT candidate region feature generator;

the Vision Transformer network is used for acquiring a multi-label classification result and generating a class activation graph; the candidate region pseudo label generator generates candidate region pseudo labels according to the category activation graph; the candidate region generator generates a COB candidate region by using a convolution-oriented boundary algorithm and a hierarchical segmentation algorithm; the ViT candidate region feature generator generates feature vectors of the COB candidate regions by adopting a SegAlign method, and the feature vectors are mapped into classification scores of the COB candidate regions through a full connection layer;

initializing a weak supervision instance segmentation model, constructing a loss function, performing iterative training on a labeled natural image data set, and optimizing the loss function to obtain a trained weak supervision instance segmentation model;

and inputting the natural image to be segmented into the trained weak supervision example segmentation model to obtain an example segmentation result.

In a preferred embodiment, the labeled natural image dataset is represented by:

wherein, X _i Representing the ith labeled natural image, Y _i A label representing the ith natural image;

representing the number of images in the tagged natural image dataset, C representing the number of tags;

before iterative training is carried out on the segmentation model of the weak supervision example by using a labeled natural image data set, randomly cutting natural images in the labeled natural image data set into images with set sizes, randomly horizontally turning the images, and carrying out standardization processing according to channels;

the initialization of the weak supervision instance segmentation model refers to the fact that the weak supervision instance segmentation model is pre-trained on a large image data set, and model parameters after pre-training are used as initialization parameters.

In a preferred technical scheme, the Loss function comprises a Focal local Loss function and a CELoss Loss function;

the Focal local Loss function is used for training a ViT multi-label classification module and is expressed as follows:

wherein y is a true tag, p _t Is the prediction probability, p _t Is defined as follows:

wherein, p is the output value of the Vision Transformer network without any activation function processing, and the acquisition mode is as follows:

dividing a natural image with an input size of W × H into W × H image blocks, each image block including P × P pixels, where W = W/P and H = W/P; inputting the image block into a Vision Transformer network to output a characteristic matrix, and mapping the characteristic matrix into a C-dimensional prediction score vector through a convolution layer and a global average pooling layer, wherein p is an output value of the Vision Transformer network;

the CELoss loss function is used to train the ViT candidate region scoring module, expressed as:

wherein, y _i,k The real label K representing the ith COB candidate region has K label values N COB candidate regions, p _i ^′ _,k Indicates the probability that the ith COB candidate region is predicted to be the kth tag value.

In an optimal technical scheme, the iterative training is performed on the labeled natural image dataset, specifically:

classifying the labeled natural image data set by using a Vision Transformer network to obtain a multi-label classification result and generate a class activation graph;

inputting the tagged natural image data set into a candidate region generator, and generating a COB candidate region by using a convolution-oriented boundary algorithm and a hierarchical segmentation algorithm;

obtaining a candidate region pseudo label by adopting a candidate region pseudo label generator according to the category activation map and the COB candidate region;

inputting the COB candidate region into a ViT candidate region generator, generating a feature vector of the COB candidate region by adopting a SegAlign method and a full connection layer, and mapping the feature vector into a classification score and a classification of the COB candidate region through the full connection layer;

calculating a loss value and optimizing a loss function, and iteratively training until the function is converged to obtain a trained weak supervision instance segmentation model.

In a preferred technical scheme, the Vision Transformer network comprises a convolutional layer, L cascaded Transformer blocks and a global average pooling layer; the transform blocks comprise a linear transformation layer, a multi-head self-attention layer and a multi-layer sensing block;

the obtaining of the multi-label classification result and the generation of the class activation graph specifically include:

inputting a labeled natural image data set into a Vision Transformer network, cutting each natural image with the size of W multiplied by H in the labeled natural image data set into W multiplied by H image blocks, performing convolution operation through a convolution layer to obtain a one-dimensional vector, and obtaining N block labels t; adding category labels to block labels

D represents the dimension of each block label;

sending all block marks added with category marks into L cascaded transformer blocks for feature extraction to obtain a feature matrix S of the image _c And L attention vectors

Feature matrix S of image _c Inputting the convolutional layer and the global average pooling layer to obtain a multi-label classification result;

for L attention vectors

Calculating an average value, and deforming according to the position of the image block in the natural image to obtain an attention diagram, wherein the formula is as follows:

A′ _* ＝Γ ^w×h (A _* )

wherein, gamma is ^w×h (. Cndot.) is a deformation function;

the attention map and the feature matrix of the image are multiplied element by element to produce a class activation map TS-CAM, denoted as

The element multiplication formula is:

in an embodiment, the candidate region pseudo tag obtained by using the candidate region pseudo tag generator specifically includes:

obtaining local peak values on class activation map in candidate region pseudo label generator

Using each local peak

And COB candidate regions

Get the auxiliary mask

Masking the auxiliary mask

According to local peak value

Is arranged in ascending order, and a certain COB candidate region R is calculated in sequence _n And auxiliary mask

Degree of overlap of

If the overlapping degree exceeds a certain threshold value lambda, the false label z of the COB candidate area is determined _n The label is of class c, i.e. z _n = c; the overlapping degree IOU represents the proportion of the overlapping part of the two areas and the part of the two area sets;

if the degree of overlap between a certain COB candidate region and all auxiliary masks is lower than the threshold, the COB candidate region is marked as a background class.

In a preferred embodiment, the obtaining of the local peak on the class activation map is performed by comparing the local peak with a threshold

The method specifically comprises the following steps:

taking out a certain class of activation graph M according to the multi-label classification result _c ；

Activation of graph M in categories _c Performing maximum pooling operation, with pooling kernel size of m × m, in the pooling kernelTraversing each position of the category activation graph by the heart and recording a local maximum value and a corresponding position coordinate;

when the local maximum position coordinate recorded at a certain pixel on the category activation map is exactly the position coordinate of the pixel, it is recorded as a local peak value

Said using each local peak

And COB candidate region

Obtain an auxiliary mask

The specific operation is as follows:

for each local peak

Finding all COB candidate regions containing the local peak value, averaging, and obtaining an auxiliary mask corresponding to the local peak value through threshold value selection

Namely:

wherein

Means that the local peak point is included

The number of COB candidate regions, p ∈ [0]And q ∈ [0,W]Is an integer, represents a coordinate index, and has a threshold value beta epsilon [0,1 ]]Is a hyper-parameter.

In an embodiment, the mapping is classification scores and categories of the COB candidate regions, and specifically includes:

in a ViT candidate region feature generator, dividing a category activation graph into n multiplied by n image blocks, and inputting a Vision Transformer network to obtain a feature vector of each image block;

splicing the feature vectors of all image blocks into a feature matrix in sequence, and then reconstructing the spliced feature matrix into a new feature matrix according to the position of each image block in the corresponding natural image;

inputting the new feature matrix into the 1 × 1 convolution, and fusing the features of each channel to obtain a feature layer F;

obtaining the features of each COB candidate area on the feature layer F by using a SegAlign method, aligning the features, and obtaining the alignment features

Aligning feature f _n Leveling to one dimension, inputting the one dimension into three full-connection layers, and obtaining classification scores of COB candidate regions after Softmax

On the other hand, the invention provides a Vision Transformer network-based weak supervision instance segmentation system, which is applied to the Vision Transformer network-based weak supervision instance segmentation method and comprises a data acquisition module, a model construction module, a model training module and an instance segmentation module;

the data acquisition module is used for acquiring a labeled natural image data set and a natural image to be segmented;

the model construction module is used for constructing a weak supervision instance segmentation model; the weak supervision instance segmentation model comprises a ViT multi-label classification module and a ViT candidate region scoring module; the ViT multi-label classification module comprises a Vision Transformer network and a candidate region pseudo label generator; the ViT candidate region scoring module comprises a candidate region generator and a ViT candidate region feature generator;

the model training module is used for initializing a weak supervision instance segmentation model, constructing a loss function, performing iterative training on a labeled natural image data set, and optimizing the loss function to obtain a trained weak supervision instance segmentation model;

the example segmentation module is used for inputting the natural image to be segmented into the trained weak supervision example segmentation model to obtain an example segmentation result.

In still another aspect, the present invention provides a computer-readable storage medium, which stores a program, and when the program is executed by a processor, the program implements the above-mentioned method for partitioning a weakly supervised instance based on a Vision Transformer network.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. in the aspect of pseudo label generation, a category activation map generated by a Vision Transformer network can cover a larger range of an object, a local peak generated on the category activation map can cover a larger area of the object, and a candidate area pseudo label constructed by the relation between the local peak and the candidate area is more accurate, so that a trained ViT candidate area scoring module can perform more accurate scoring, and a more accurate example segmentation result is obtained.

2. The feature layer of the image generated by the Vision Transformer network can pay attention to the global information of the image due to a multi-head attention mechanism, and in the transfer learning, the Vision Transformer network for feature extraction greatly reduces the parameter quantity to be learned, can shorten the time of training and testing, and saves the computing resources.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flowchart of a Vision Transformer-based method for partitioning a weakly supervised instance according to an embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating generation of pseudo labels for candidate regions according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a candidate region feature generator according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a SegAlign process in accordance with an embodiment of the present invention;

FIG. 5 is a block diagram of a Vision Transformer-based weakly supervised example partitioning system according to an embodiment of the present invention;

FIG. 6 is a diagram of a computer storage medium according to an embodiment of the invention.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

Referring to fig. 1, in an embodiment of the present application, a Vision Transformer-based weakly supervised instance segmentation method is provided, including the following steps:

s1, acquiring a labeled natural image data set and a natural image to be segmented;

in this embodiment, X represents a natural image containing multiple classes of objects, Y represents a label of multiple classes, and Y = [ Y = ₁ ,…,y _c ]∈{0,1} ^C×1 ，y _c =1 denotes an object of the category c included in the image, y _c =0 indicates that the object of the category c is not included in the image; the acquired tagged natural image dataset is therefore represented as:

wherein X _i Representing the ith labeled natural image, Y _i A label representing the ith natural image;

representing the number of images of the tagged natural image dataset and C representing the number of tags.

Before iterative training is carried out on the segmentation model of the weak supervision example by using the labeled natural image data set, the natural images in the labeled natural image data set are randomly cut into images with set sizes, the images are randomly and horizontally turned, and then standardization processing is carried out according to channels. The natural image is cut out into an image of 224 × 224 size in the present embodiment.

S2, constructing a weak supervision example segmentation model, wherein the weak supervision example segmentation model comprises a ViT multi-label classification module and a ViT candidate region scoring module as shown in figure 1; the ViT multi-label classification module comprises a Vision Transformer network and a candidate region pseudo label generator; the ViT candidate region scoring module comprises a candidate region generator and a ViT candidate region feature generator;

the Vision Transformer network is used for acquiring a multi-label classification result and generating a class activation graph; a candidate region pseudo label generator generates candidate region pseudo labels according to the category activation graph; a candidate region generator generates a COB candidate region by using a convolution-oriented boundary algorithm and a hierarchical segmentation algorithm; the ViT candidate region feature generator generates feature vectors of the COB candidate regions by adopting a SegAlign method, and the feature vectors are mapped into classification scores of the COB candidate regions through a full connection layer;

s3, initializing a weak supervision instance segmentation model, constructing a loss function, performing iterative training on a labeled natural image data set, and optimizing the loss function to obtain a trained weak supervision instance segmentation model;

s31, initializing the weak supervision example segmentation model refers to pre-training the weak supervision example segmentation model on the large image data set, and using model parameters after pre-training as initialization parameters. In the embodiment, the weak supervision example segmentation model is pre-trained on the ImageNet21k data set, and the parameters after pre-training are used as initialization parameters; the ImageNet21k data set is an image database organized according to a WordNet hierarchical structure, and has the advantages of more pictures, high resolution, more types and more irrelevant noises and changes, so the recognition difficulty is high, and the data set is commonly used for the evaluation of classification, positioning and detection tasks and can prevent the over-fitting phenomenon.

S32, because the weak supervision example segmentation model constructed by the method comprises two module branches, the two module molecules need to be trained respectively, and for the ViT multi-label classification module, in order to solve the class imbalance problem, the module branch is trained by adopting a Focal local Loss function, which is expressed as:

wherein, p is a Vision Transformer network output value which is not processed by any activating function, and the obtaining mode is as follows:

dividing a natural image with an input size of W × H into W × H image blocks, each image block including P × P pixels, where W = W/P and H = W/P; and inputting the image block into a Vision Transformer network to output a characteristic matrix, and mapping the characteristic matrix into a C-dimensional prediction score vector which is p through a convolution layer and a global average pooling layer. In the present embodiment, each natural image of 224 × 224 size is divided into 14 × 14 image blocks, each of which has 16 × 16 pixels, and a feature matrix of 768 × 14 × 14 size is obtained.

For the ViT candidate region scoring module, it is trained using the CELoss loss function, which is expressed as:

wherein, y _i,k Real tag K representing the ith COB candidate region, and having K tag values N COB candidate regions, p' _i,k And the probability of predicting the ith COB candidate region as the kth label value is represented, and the inter-class distance is also increased to a certain extent through fitting to a CELoss loss function.

S33, in the process of carrying out iterative training on the labeled natural image data set by the two module branches, any weight is not frozen, and the training process is as follows:

s331, classifying the labeled natural image data set by using a Vision Transformer network to obtain a multi-label classification result and generate a class activation graph;

specifically, the Vision Transformer network comprises a convolution layer, L cascaded Transformer blocks and a global average pooling layer; each transform block comprises a linear transformation layer, a multi-head self-attention layer and a multi-layer sensing block MLP;

inputting a labeled natural image data set into a Vision Transformer network, cutting each natural image in the labeled natural image data set into w multiplied by h image blocks, performing convolution operation through a convolution layer to obtain a one-dimensional vector, and obtaining a N = w multiplied by h block label t; then adding category marks to the block marks

D represents the dimension of each block label;

sending all block marks added with category marks into L cascaded transform blocks for feature extraction to obtain a feature matrix S of the image _c And L attention vectors

Note the book

For the input of the first transformer block, in the attention operation of the first transformer block, the output block label is calculated by:

wherein the parameter matrix

And

respectively representing linear transformation layers before the attention operation is carried out on the first transformer block; matrix A ^l Is an attention matrix whose first row is a category labeled attention vector

Attention vector

Records the dependency of the class label on the labels of other image blocks, and the attention vector when the loss function works

Approaching the object region that is useful for the classification task.

Feature matrix of image

Inputting the convolutional layer and the global average pooling layer to obtain a multi-label classification result;

for L attention vectors

A′ _* ＝Γ ^w×h (A _* )

wherein, gamma is ^w×h (. Cndot.) is a deformation function;

the attention map is multiplied by the feature matrix of the image element by element to produce a class activation map TS-CAM, denoted as

The element multiplication formula is:

s332, inputting the labeled natural image data set into a candidate region generator, and generating a COB candidate region by using a convolution-oriented boundary algorithm and a hierarchical segmentation algorithm;

a Convolutional guided boundary (COB) is based on MCG, a deep Convolutional neural network is used to obtain edge information of an image, the algorithm realizes an end-to-end learnable Convolutional neural network to detect the edge and the corresponding direction of the image, and only one-time forward propagation of the image level is needed to generate multi-scale contour information and estimate the direction of the edge; the COB candidate regions are generated by using the MCG hierarchical segmentation algorithm using these multi-scale contour information and edge directions.

S333, obtaining a candidate region pseudo label by using a candidate region pseudo label generator according to the category activation graph and the COB candidate region;

as shown in FIG. 2, in the candidate region pseudo label generator, local peaks on the class activation map are obtained

The method is as follows:

firstly, a certain class of activation graph M is taken out according to a multi-label classification result _c (ii) a The graph M is then activated in categories _c Executing maximum pooling operation, wherein the size of a pooling core is mxm, and the center of the pooling core traverses each position of the category activation graph and records a local maximum value and a corresponding position coordinate; when the position coordinate of the local maximum value recorded at a certain pixel on the category activation map is just the position coordinate of the pixel, the local maximum value is recorded as a local peak value of the category activation map

Using each peak value

And COB candidate region

Obtain an auxiliary mask

The specific operation is as follows:

for each local peak

Finding all COB candidate regions containing the local peak value, averaging the COB candidate regions, and obtaining an auxiliary mask corresponding to the local peak value through threshold value extraction

Namely:

wherein

Means that the local peak point is included

The number of COB candidate regions, p ∈ [0]And q ∈ [0]Is an integer and represents a coordinate index, H and W represent the height and width of the natural image, respectively, and the threshold value beta belongs to [0,1 ]]Is a hyper-parameter.

Masking the auxiliary mask

According to local peak value

Is arranged in ascending order, and a certain COB candidate area R is calculated in sequence _n And auxiliary mask

Degree of overlap of

If the overlapping degree exceeds a certain threshold value lambda, the false label z of the COB candidate area is determined _n The label being of class c, i.e. z _n = c; wherein the degree of overlap IOU represents the ratio of the overlapping portion of the two regions to the portion of the set of the two regions; the threshold λ =0.5 in this embodiment.

S334, inputting the COB candidate region into a ViT candidate region generator, generating a feature vector of the COB candidate region by adopting a SegAlign method and a full connection layer, and mapping the feature vector into a classification score and a classification of the COB candidate region through the full connection layer;

specifically, as shown in fig. 3, in the ViT candidate region feature generator, the category activation map is divided into n × n image blocks, and a Vision Transformer network is input to obtain a feature vector Token embedding of each image block; in this embodiment, the class activation map is divided into 14 × 14 image blocks, and the obtained feature vector dimension is 768 dimensions.

Splicing the feature vectors Token embedding of all image blocks into a 196 x 768 feature matrix according to the sequence, and then reconstructing (reshape) the spliced feature matrix, namely the feature matrix containing n x n (14 x 14) feature vectors, into a new feature matrix according to the position of each image block in the corresponding natural image;

inputting the new Feature matrix into 1 × 1 convolution conv, and fusing the features of each channel to obtain a Feature layer F (Feature Maps); the function of the 1 × 1 convolution is to fuse the features of each channel, and only change the number of channels of the feature map without changing the width and height dimensions of the feature map;

obtaining the features of each COB candidate region on the feature layer F by using a SegAlign method, aligning the features to obtain the alignment features

In this embodiment, the number of nodes in the three fully connected layers is 4096, and C (number of labels).

SegAlign is an improved version of RoiAlign and can be applied to a COB candidate region, as shown in FIG. 4, segAlign outputs an alignment feature F corresponding to a candidate region according to an input feature layer F and the candidate region R corresponding to the graph _n Specifically, the following steps:

for a candidate region R, the receptive field on the feature map F is R _F Sense of correspondence to the candidate region RReceptor field R _F The circumscribed rectangle is B, if function

Is a bilinear transformation from spatial coordinates (i, j) ∈ f to (i ', j') ∈ B, that is to say

Is a bilinear interpolation function on the feature map F, the SegAlign operator can be expressed as:

in this formula, the channel dimensions of the feature layer are not represented in the formula for simplicity and convenience of representation.

And S335, calculating a loss value, optimizing a loss function, and iteratively training until the function is converged to obtain a trained weak supervision instance segmentation model.

And S4, inputting the natural image to be segmented into the trained weak supervision example segmentation model to obtain an example segmentation result.

Inputting a natural image into a COB candidate region generator to generate a candidate region

And then inputting the obtained data into a ViT candidate region scoring module to obtain the category and score of each candidate region, and finally obtaining a final example segmentation result through non-maximum suppression (NMS).

It should be noted that, for the sake of simplicity, the foregoing method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present invention is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present invention.

Based on the same idea as the Vision Transformer-based weakly supervised instance segmentation method in the embodiment, the invention further provides a Vision Transformer-based weakly supervised instance segmentation system, which can be used for executing the Vision Transformer-based weakly supervised instance segmentation method. For convenience of explanation, in the structural schematic diagram of the embodiment of the Vision Transformer-based weak supervision example segmentation system, only the part related to the embodiment of the present invention is shown, and those skilled in the art will understand that the illustrated structure does not constitute a limitation to the apparatus, and may include more or less components than those illustrated, or combine some components, or arrange different components.

Referring to fig. 5, in another embodiment of the present application, a Vision Transformer-based weakly supervised instance partitioning system is provided, which includes a data acquisition module, a model construction module, a model training module, and an instance partitioning module;

the model building module is used for building a weak supervision instance segmentation model and comprises a ViT multi-label classification module and a ViT candidate region scoring module; the ViT multi-label classification module comprises a Vision Transformer network and a candidate region pseudo label generator; the ViT candidate region scoring module comprises a candidate region generator and a ViT candidate region feature generator;

the model training module is used for initializing the weak supervision example segmentation model, constructing a loss function, performing iterative training on a labeled natural image data set, and optimizing the loss function to obtain a trained weak supervision example segmentation model;

and the example segmentation module is used for inputting the natural image to be segmented into the trained weak supervision example segmentation model to obtain an example segmentation result.

It should be noted that, the Vision Transformer based weak supervision example segmentation system of the present invention corresponds to the Vision Transformer based weak supervision example segmentation method of the present invention one to one, and the technical features and the beneficial effects described in the embodiment of the Vision Transformer based weak supervision example segmentation method are all applicable to the embodiment of the Vision Transformer based weak supervision example segmentation system, and specific contents may refer to the description in the embodiment of the method of the present invention, and are not described herein again, and this is declared.

In addition, in the implementation of the Vision Transformer-based weakly supervised instance partitioning system according to the above embodiment, the logical partitioning of each program module is only an example, and in practical applications, the above function allocation may be performed by different program modules according to needs, for example, due to the configuration requirements of corresponding hardware or the convenience of software implementation, that is, the internal structure of the Vision Transformer-based weakly supervised instance partitioning system is partitioned into different program modules, so as to perform all or part of the above described functions.

Referring to fig. 6, in an embodiment, a computer-readable storage medium is provided, which stores a program in a memory, and when the program is executed by a processor, the method for partitioning a weakly supervised instance based on a Vision Transformer may be implemented, where the method includes:

acquiring a labeled natural image data set and a natural image to be segmented;

constructing a weak supervision instance segmentation model, which comprises a ViT multi-label classification module and a ViT candidate region evaluation module; the ViT multi-label classification module comprises a Vision Transformer network and a candidate region pseudo label generator; the ViT candidate region scoring module comprises a candidate region generator and a ViT candidate region feature generator;

the Vision Transformer network is used for acquiring a multi-label classification result and generating a class activation graph; a candidate region pseudo label generator generates candidate region pseudo labels according to the category activation graph; a candidate region generator generates a COB candidate region by using a convolution-oriented boundary algorithm and a hierarchical segmentation algorithm; a ViT candidate region feature generator generates feature vectors of the COB candidate regions by adopting a SegAlign method, and the feature vectors are mapped into classification scores of the COB candidate regions through a full connection layer;

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by a computer program, which may be stored in a non-volatile computer readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct Rambus Dynamic RAM (DRDRAM), and Rambus Dynamic RAM (RDRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such modifications are intended to be included in the scope of the present invention.

Claims

1. The weak supervision instance segmentation method based on the Vision Transformer network is characterized by comprising the following steps of:

the Vision Transformer network is used for acquiring a multi-label classification result and generating a class activation graph; the candidate area pseudo label generator generates candidate area pseudo labels according to the category activation graph; the candidate region generator generates a COB candidate region by using a convolution-oriented boundary algorithm and a hierarchical segmentation algorithm; the ViT candidate region feature generator generates a feature vector of the COB candidate region by adopting a SegAlign method, and the feature vector passes through a full connection layer and is mapped to be a classification score of the COB candidate region;

2. The Vision Transformer network-based weakly supervised instance segmentation method of claim 1, wherein the tagged natural image dataset is represented as:

the initialization of the weak supervision example segmentation model is to pre-train the weak supervision example segmentation model on a large image data set, and use model parameters after pre-training as initialization parameters.

3. The Vision Transformer network based weakly supervised instance segmentation method of claim 2, wherein the Loss functions comprise a Focal local Loss function and a CELOSs Loss function;

wherein y is the true tag, p _t Is the prediction probability, p _t Is defined as follows:

dividing a natural image with an input size of W × H into W × H image blocks, each image block including P × P pixels, where W = W/P and H = W/P; inputting the image block into a Vision Transformer network to output a characteristic matrix, mapping the characteristic matrix into a C-dimensional prediction score vector through a convolution layer and a global average pooling layer, wherein p is an output value of the Vision Transformer network;

the CELoss loss function is used to train the ViT candidate region scoring module, and is expressed as:

wherein, y _i,k Real tag K representing the ith COB candidate region, and having K tag values N COB candidate regions, p' _i,k Indicates the probability that the ith COB candidate region is predicted to be the kth tag value.

4. The Vision Transformer network-based weakly supervised instance segmentation method of claim 3, wherein the iterative training is performed on the labeled natural image dataset, specifically:

classifying the natural image dataset with the labels by using a Vision Transformer network to obtain a multi-label classification result and generate a class activation graph;

inputting the labeled natural image data set into a candidate region generator, and generating a COB candidate region by using a convolution-oriented boundary algorithm and a hierarchical segmentation algorithm;

obtaining a candidate area pseudo label by adopting a candidate area pseudo label generator according to the category activation graph and the COB candidate area;

calculating a loss value and optimizing a loss function, and performing iterative training until the function is converged to obtain a trained weak supervision instance segmentation model.

5. The Vision Transformer network based weakly supervised instance partitioning method of claim 4, wherein the Vision Transformer network comprises convolution layers, L cascaded Transformer blocks and a global average pooling layer; the transformer blocks comprise a linear transformation layer, a multi-head self-attention layer and a multi-layer sensing block;

D represents the dimension of each block label;

for L attention vectors

A′ _* ＝Γ ^w×h (A _* )

wherein, gamma is ^w×h (. Cndot.) is a deformation function;

The element multiplication formula is:

6. the Vision Transformer network-based weak supervision instance segmentation method according to claim 5, wherein the candidate region pseudo label obtained by using the candidate region pseudo label generator is specifically:

Using each local peak

And COB candidate region

Get the auxiliary mask

Masking the auxiliary mask

According to local peak value

Degree of overlap of

If the overlapping degree exceeds a certain threshold value lambda, the COB candidate area is subjected to false labelingSign z _n The label is of class c, i.e. z _n = c; the degree of overlap IOU represents the proportion of the overlapping part of the two areas to the part of the two area sets;

7. The Vision Transformer network based weakly supervised instance segmentation method of claim 6, wherein the obtaining of the local peak on the class activation graph

The method specifically comprises the following steps:

Activation of graph M in categories _c Executing maximum pooling operation, wherein the size of a pooling core is m multiplied by m, traversing each position of the category activation graph by the center of the pooling core, and recording a local maximum value and a corresponding position coordinate;

Said using each local peak

And COB candidate region

Obtain an auxiliary mask

The specific operation is as follows:

for each local peak

Namely:

wherein

Means that the local peak point is included

8. The Vision Transformer network-based weakly supervised instance partitioning method of claim 6, wherein the mapping is classification scores and categories of COB candidate regions, specifically:

9. The Vision Transformer network-based weak supervision instance segmentation system is characterized by being applied to the Vision Transformer network-based weak supervision instance segmentation method of any one of claims 1-8, and comprising a data acquisition module, a model construction module, a model training module and an instance segmentation module;

the model construction module is used for constructing a weak supervision instance segmentation model; the weak supervision example segmentation model comprises a ViT multi-label classification module and a ViT candidate region evaluation module; the ViT multi-label classification module comprises a Vision Transformer network and a candidate area pseudo label generator; the ViT candidate region scoring module comprises a candidate region generator and a ViT candidate region feature generator;

10. A computer-readable storage medium storing a program, wherein the program, when executed by a processor, implements the Vision Transformer network-based weakly supervised instance segmentation method of any of claims 1 to 8.