CN115359254A - Vision transform network-based weak supervision instance segmentation method, system and medium - Google Patents

Vision transform network-based weak supervision instance segmentation method, system and medium Download PDF

Info

Publication number
CN115359254A
CN115359254A CN202210877230.0A CN202210877230A CN115359254A CN 115359254 A CN115359254 A CN 115359254A CN 202210877230 A CN202210877230 A CN 202210877230A CN 115359254 A CN115359254 A CN 115359254A
Authority
CN
China
Prior art keywords
candidate region
cob
natural image
vit
weak supervision
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210877230.0A
Other languages
Chinese (zh)
Inventor
余晋刚
梁宇琦
吴仕科
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202210877230.0A priority Critical patent/CN115359254A/en
Publication of CN115359254A publication Critical patent/CN115359254A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a method, a system and a medium for partitioning a weak supervision instance based on a Vision Transformer network, wherein the method comprises the following steps: acquiring a labeled natural image data set and a natural image to be segmented; constructing a weak supervision example segmentation model; the weak supervision instance segmentation model comprises a ViT multi-label classification module and a ViT candidate region scoring module; the ViT multi-label classification module comprises a Vision Transformer network and a candidate area pseudo label generator; the ViT candidate region scoring module comprises a candidate region generator and a ViT candidate region feature generator; initializing a weak supervision instance segmentation model, constructing a loss function, performing iterative training on a labeled natural image data set, and optimizing the loss function to obtain a trained weak supervision instance segmentation model; and inputting the natural image to be segmented into the trained weak supervision example segmentation model to obtain an example segmentation result. The invention realizes the example segmentation of the natural image, and accelerates the reasoning speed and reduces the consumption of computing power while keeping higher performance.

Description

Vision transform network-based weak supervision instance segmentation method, system and medium
Technical Field
The invention belongs to the technical field of weak supervision instance segmentation, and particularly relates to a method, a system and a medium for weak supervision instance segmentation based on a Vision Transformer network.
Background
Example segmentation is one of the key problems in the field of image understanding and computer vision, and is the task of predicting a single pixel class in an image and providing different example labels for individual examples belonging to the same class. In medical image analysis, the example segmentation can understand data extremely accurately, and the diagnosis efficiency and accuracy are greatly improved; in the fields of robots and automatic driving, the example segmentation provides scene understanding at a pixel level, and recognition efficiency and precision are improved. While achieving high performance example segmentation results requires fine labeling at the pixel level, researchers are more concerned about how to train models with image-level class labeling to approximate example segmentation performance in the fully supervised case in order to save time and money costs. Weakly supervised instance segmentation as a new research direction, whose core is to locate each instance and find explicit instance boundaries, however, presents a series of challenges: firstly, under the condition of only image-level class labeling, a trained multi-label classification network only classifies according to the most distinguishing regional characteristics of objects of various classes in a data set, so that a class activation graph or a saliency graph obtained by a CNN (content-based network) network usually only focuses on incomplete number examples and part of examples of the objects of the same class, and positioning information of different examples has defects; second, it is not easy to find explicit instance boundaries, and CNN networks do not automatically delineate boundaries between instances without pixel-level instance boundary markers.
In order to solve the above technical problems, there are three solutions in the prior art: the first is to generate candidate masks of the image by an advanced method, and the redundant candidate masks probably contain all instances in the picture and have relatively accurate boundaries; secondly, establishing a supervision signal of an example boundary according to a category activation graph or a saliency graph generated by the CNN network and the change of the pixel gray value, and training an example filling module; thirdly, designing a new back propagation mode based on the CNN architecture, and finally obtaining the outline information of the example in the original image by back propagation of the peak value response point of the example; for example, a PRM method proposed by Zhou et al in Weakly Supervised instant Segmentation using Class Peak Response; the WS-RCNN framework proposed by Ou J R et al in the document "Learning to Score probes for week superior distance Segmentation". However, in the prior art, on one hand, due to the defects of a candidate mask scoring mechanism and the problem that the category activation graph generated by the CNN network only focuses on the most significant region, a large number of non-significant examples are often lost, so that the example segmentation result only focuses on the part with the most identification degree, and the effect is poor; on the other hand, in order to ensure higher performance, an advanced large and deep CNN network is used as a basic framework to design an instance segmentation model, and a large amount of computing resources are consumed in the model training process, so that the training and reasoning time is long, and the efficiency is low; in particular, the prior art only addresses the task of instance segmentation in a single computer vision modality (CV), and cannot be directly fused with the task in a natural language processing modality (NLP) to achieve higher-level tasks.
Disclosure of Invention
The invention mainly aims to overcome the defects of the prior art and provide a Vision Transformer-based weak supervision instance segmentation method, a Vision Transformer-based weak supervision instance segmentation system and a Vision Transformer-based weak supervision instance segmentation medium, wherein a class activation graph is generated by learning global information of an image through a Vision Transformer network, and a COB candidate region generated by combining a convolution guide boundary algorithm and a hierarchical segmentation algorithm is used for constructing a pseudo label; and finally, predicting the classification score of the COB candidate region based on the class activation graph through a ViT candidate region feature generator to obtain an example segmentation result, wherein the training parameters are few, the time is short, and the result is accurate and effective.
In order to achieve the purpose, the invention adopts the following technical scheme:
in one aspect, the invention provides a Vision Transformer network-based weak supervision instance segmentation method, which comprises the following steps:
acquiring a natural image data set with a label and a natural image to be segmented;
constructing a weak supervision example segmentation model; the weak supervision instance segmentation model comprises a ViT multi-label classification module and a ViT candidate region scoring module; the ViT multi-label classification module comprises a Vision Transformer network and a candidate region pseudo label generator; the ViT candidate region scoring module comprises a candidate region generator and a ViT candidate region feature generator;
the Vision Transformer network is used for acquiring a multi-label classification result and generating a class activation graph; the candidate region pseudo label generator generates candidate region pseudo labels according to the category activation graph; the candidate region generator generates a COB candidate region by using a convolution-oriented boundary algorithm and a hierarchical segmentation algorithm; the ViT candidate region feature generator generates feature vectors of the COB candidate regions by adopting a SegAlign method, and the feature vectors are mapped into classification scores of the COB candidate regions through a full connection layer;
initializing a weak supervision instance segmentation model, constructing a loss function, performing iterative training on a labeled natural image data set, and optimizing the loss function to obtain a trained weak supervision instance segmentation model;
and inputting the natural image to be segmented into the trained weak supervision example segmentation model to obtain an example segmentation result.
In a preferred embodiment, the labeled natural image dataset is represented by:
Figure BDA0003762840710000021
wherein, X i Representing the ith labeled natural image, Y i A label representing the ith natural image;
Figure BDA0003762840710000022
representing the number of images in the tagged natural image dataset, C representing the number of tags;
before iterative training is carried out on the segmentation model of the weak supervision example by using a labeled natural image data set, randomly cutting natural images in the labeled natural image data set into images with set sizes, randomly horizontally turning the images, and carrying out standardization processing according to channels;
the initialization of the weak supervision instance segmentation model refers to the fact that the weak supervision instance segmentation model is pre-trained on a large image data set, and model parameters after pre-training are used as initialization parameters.
In a preferred technical scheme, the Loss function comprises a Focal local Loss function and a CELoss Loss function;
the Focal local Loss function is used for training a ViT multi-label classification module and is expressed as follows:
Figure BDA0003762840710000031
wherein y is a true tag, p t Is the prediction probability, p t Is defined as follows:
Figure BDA0003762840710000032
wherein, p is the output value of the Vision Transformer network without any activation function processing, and the acquisition mode is as follows:
dividing a natural image with an input size of W × H into W × H image blocks, each image block including P × P pixels, where W = W/P and H = W/P; inputting the image block into a Vision Transformer network to output a characteristic matrix, and mapping the characteristic matrix into a C-dimensional prediction score vector through a convolution layer and a global average pooling layer, wherein p is an output value of the Vision Transformer network;
the CELoss loss function is used to train the ViT candidate region scoring module, expressed as:
Figure BDA0003762840710000033
wherein, y i,k The real label K representing the ith COB candidate region has K label values N COB candidate regions, p i ,k Indicates the probability that the ith COB candidate region is predicted to be the kth tag value.
In an optimal technical scheme, the iterative training is performed on the labeled natural image dataset, specifically:
classifying the labeled natural image data set by using a Vision Transformer network to obtain a multi-label classification result and generate a class activation graph;
inputting the tagged natural image data set into a candidate region generator, and generating a COB candidate region by using a convolution-oriented boundary algorithm and a hierarchical segmentation algorithm;
obtaining a candidate region pseudo label by adopting a candidate region pseudo label generator according to the category activation map and the COB candidate region;
inputting the COB candidate region into a ViT candidate region generator, generating a feature vector of the COB candidate region by adopting a SegAlign method and a full connection layer, and mapping the feature vector into a classification score and a classification of the COB candidate region through the full connection layer;
calculating a loss value and optimizing a loss function, and iteratively training until the function is converged to obtain a trained weak supervision instance segmentation model.
In a preferred technical scheme, the Vision Transformer network comprises a convolutional layer, L cascaded Transformer blocks and a global average pooling layer; the transform blocks comprise a linear transformation layer, a multi-head self-attention layer and a multi-layer sensing block;
the obtaining of the multi-label classification result and the generation of the class activation graph specifically include:
inputting a labeled natural image data set into a Vision Transformer network, cutting each natural image with the size of W multiplied by H in the labeled natural image data set into W multiplied by H image blocks, performing convolution operation through a convolution layer to obtain a one-dimensional vector, and obtaining N block labels t; adding category labels to block labels
Figure BDA0003762840710000041
D represents the dimension of each block label;
sending all block marks added with category marks into L cascaded transformer blocks for feature extraction to obtain a feature matrix S of the image c And L attention vectors
Figure BDA0003762840710000042
Feature matrix S of image c Inputting the convolutional layer and the global average pooling layer to obtain a multi-label classification result;
for L attention vectors
Figure BDA0003762840710000043
Calculating an average value, and deforming according to the position of the image block in the natural image to obtain an attention diagram, wherein the formula is as follows:
Figure BDA0003762840710000044
A′ * =Γ w×h (A * )
wherein, gamma is w×h (. Cndot.) is a deformation function;
the attention map and the feature matrix of the image are multiplied element by element to produce a class activation map TS-CAM, denoted as
Figure BDA0003762840710000045
The element multiplication formula is:
Figure BDA0003762840710000046
in an embodiment, the candidate region pseudo tag obtained by using the candidate region pseudo tag generator specifically includes:
obtaining local peak values on class activation map in candidate region pseudo label generator
Figure BDA0003762840710000047
Using each local peak
Figure BDA0003762840710000048
And COB candidate regions
Figure BDA0003762840710000049
Get the auxiliary mask
Figure BDA00037628407100000410
Masking the auxiliary mask
Figure BDA00037628407100000411
According to local peak value
Figure BDA00037628407100000412
Is arranged in ascending order, and a certain COB candidate region R is calculated in sequence n And auxiliary mask
Figure BDA00037628407100000413
Degree of overlap of
Figure BDA00037628407100000414
If the overlapping degree exceeds a certain threshold value lambda, the false label z of the COB candidate area is determined n The label is of class c, i.e. z n = c; the overlapping degree IOU represents the proportion of the overlapping part of the two areas and the part of the two area sets;
if the degree of overlap between a certain COB candidate region and all auxiliary masks is lower than the threshold, the COB candidate region is marked as a background class.
In a preferred embodiment, the obtaining of the local peak on the class activation map is performed by comparing the local peak with a threshold
Figure BDA00037628407100000415
The method specifically comprises the following steps:
taking out a certain class of activation graph M according to the multi-label classification result c
Activation of graph M in categories c Performing maximum pooling operation, with pooling kernel size of m × m, in the pooling kernelTraversing each position of the category activation graph by the heart and recording a local maximum value and a corresponding position coordinate;
when the local maximum position coordinate recorded at a certain pixel on the category activation map is exactly the position coordinate of the pixel, it is recorded as a local peak value
Figure BDA00037628407100000416
Said using each local peak
Figure BDA0003762840710000051
And COB candidate region
Figure BDA0003762840710000052
Obtain an auxiliary mask
Figure BDA0003762840710000053
The specific operation is as follows:
for each local peak
Figure BDA0003762840710000054
Finding all COB candidate regions containing the local peak value, averaging, and obtaining an auxiliary mask corresponding to the local peak value through threshold value selection
Figure BDA0003762840710000055
Namely:
Figure BDA0003762840710000056
Figure BDA0003762840710000057
wherein
Figure BDA0003762840710000058
Means that the local peak point is included
Figure BDA0003762840710000059
The number of COB candidate regions, p ∈ [0]And q ∈ [0,W]Is an integer, represents a coordinate index, and has a threshold value beta epsilon [0,1 ]]Is a hyper-parameter.
In an embodiment, the mapping is classification scores and categories of the COB candidate regions, and specifically includes:
in a ViT candidate region feature generator, dividing a category activation graph into n multiplied by n image blocks, and inputting a Vision Transformer network to obtain a feature vector of each image block;
splicing the feature vectors of all image blocks into a feature matrix in sequence, and then reconstructing the spliced feature matrix into a new feature matrix according to the position of each image block in the corresponding natural image;
inputting the new feature matrix into the 1 × 1 convolution, and fusing the features of each channel to obtain a feature layer F;
obtaining the features of each COB candidate area on the feature layer F by using a SegAlign method, aligning the features, and obtaining the alignment features
Figure BDA00037628407100000510
Aligning feature f n Leveling to one dimension, inputting the one dimension into three full-connection layers, and obtaining classification scores of COB candidate regions after Softmax
Figure BDA00037628407100000511
On the other hand, the invention provides a Vision Transformer network-based weak supervision instance segmentation system, which is applied to the Vision Transformer network-based weak supervision instance segmentation method and comprises a data acquisition module, a model construction module, a model training module and an instance segmentation module;
the data acquisition module is used for acquiring a labeled natural image data set and a natural image to be segmented;
the model construction module is used for constructing a weak supervision instance segmentation model; the weak supervision instance segmentation model comprises a ViT multi-label classification module and a ViT candidate region scoring module; the ViT multi-label classification module comprises a Vision Transformer network and a candidate region pseudo label generator; the ViT candidate region scoring module comprises a candidate region generator and a ViT candidate region feature generator;
the model training module is used for initializing a weak supervision instance segmentation model, constructing a loss function, performing iterative training on a labeled natural image data set, and optimizing the loss function to obtain a trained weak supervision instance segmentation model;
the example segmentation module is used for inputting the natural image to be segmented into the trained weak supervision example segmentation model to obtain an example segmentation result.
In still another aspect, the present invention provides a computer-readable storage medium, which stores a program, and when the program is executed by a processor, the program implements the above-mentioned method for partitioning a weakly supervised instance based on a Vision Transformer network.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. in the aspect of pseudo label generation, a category activation map generated by a Vision Transformer network can cover a larger range of an object, a local peak generated on the category activation map can cover a larger area of the object, and a candidate area pseudo label constructed by the relation between the local peak and the candidate area is more accurate, so that a trained ViT candidate area scoring module can perform more accurate scoring, and a more accurate example segmentation result is obtained.
2. The feature layer of the image generated by the Vision Transformer network can pay attention to the global information of the image due to a multi-head attention mechanism, and in the transfer learning, the Vision Transformer network for feature extraction greatly reduces the parameter quantity to be learned, can shorten the time of training and testing, and saves the computing resources.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a flowchart of a Vision Transformer-based method for partitioning a weakly supervised instance according to an embodiment of the present invention;
FIG. 2 is a schematic diagram illustrating generation of pseudo labels for candidate regions according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a candidate region feature generator according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a SegAlign process in accordance with an embodiment of the present invention;
FIG. 5 is a block diagram of a Vision Transformer-based weakly supervised example partitioning system according to an embodiment of the present invention;
FIG. 6 is a diagram of a computer storage medium according to an embodiment of the invention.
Detailed Description
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
Referring to fig. 1, in an embodiment of the present application, a Vision Transformer-based weakly supervised instance segmentation method is provided, including the following steps:
s1, acquiring a labeled natural image data set and a natural image to be segmented;
in this embodiment, X represents a natural image containing multiple classes of objects, Y represents a label of multiple classes, and Y = [ Y = 1 ,…,y c ]∈{0,1} C×1 ,y c =1 denotes an object of the category c included in the image, y c =0 indicates that the object of the category c is not included in the image; the acquired tagged natural image dataset is therefore represented as:
Figure BDA0003762840710000071
wherein X i Representing the ith labeled natural image, Y i A label representing the ith natural image;
Figure BDA0003762840710000072
representing the number of images of the tagged natural image dataset and C representing the number of tags.
Before iterative training is carried out on the segmentation model of the weak supervision example by using the labeled natural image data set, the natural images in the labeled natural image data set are randomly cut into images with set sizes, the images are randomly and horizontally turned, and then standardization processing is carried out according to channels. The natural image is cut out into an image of 224 × 224 size in the present embodiment.
S2, constructing a weak supervision example segmentation model, wherein the weak supervision example segmentation model comprises a ViT multi-label classification module and a ViT candidate region scoring module as shown in figure 1; the ViT multi-label classification module comprises a Vision Transformer network and a candidate region pseudo label generator; the ViT candidate region scoring module comprises a candidate region generator and a ViT candidate region feature generator;
the Vision Transformer network is used for acquiring a multi-label classification result and generating a class activation graph; a candidate region pseudo label generator generates candidate region pseudo labels according to the category activation graph; a candidate region generator generates a COB candidate region by using a convolution-oriented boundary algorithm and a hierarchical segmentation algorithm; the ViT candidate region feature generator generates feature vectors of the COB candidate regions by adopting a SegAlign method, and the feature vectors are mapped into classification scores of the COB candidate regions through a full connection layer;
s3, initializing a weak supervision instance segmentation model, constructing a loss function, performing iterative training on a labeled natural image data set, and optimizing the loss function to obtain a trained weak supervision instance segmentation model;
s31, initializing the weak supervision example segmentation model refers to pre-training the weak supervision example segmentation model on the large image data set, and using model parameters after pre-training as initialization parameters. In the embodiment, the weak supervision example segmentation model is pre-trained on the ImageNet21k data set, and the parameters after pre-training are used as initialization parameters; the ImageNet21k data set is an image database organized according to a WordNet hierarchical structure, and has the advantages of more pictures, high resolution, more types and more irrelevant noises and changes, so the recognition difficulty is high, and the data set is commonly used for the evaluation of classification, positioning and detection tasks and can prevent the over-fitting phenomenon.
S32, because the weak supervision example segmentation model constructed by the method comprises two module branches, the two module molecules need to be trained respectively, and for the ViT multi-label classification module, in order to solve the class imbalance problem, the module branch is trained by adopting a Focal local Loss function, which is expressed as:
Figure BDA0003762840710000081
wherein y is a true tag, p t Is the prediction probability, p t Is defined as follows:
Figure BDA0003762840710000082
wherein, p is a Vision Transformer network output value which is not processed by any activating function, and the obtaining mode is as follows:
dividing a natural image with an input size of W × H into W × H image blocks, each image block including P × P pixels, where W = W/P and H = W/P; and inputting the image block into a Vision Transformer network to output a characteristic matrix, and mapping the characteristic matrix into a C-dimensional prediction score vector which is p through a convolution layer and a global average pooling layer. In the present embodiment, each natural image of 224 × 224 size is divided into 14 × 14 image blocks, each of which has 16 × 16 pixels, and a feature matrix of 768 × 14 × 14 size is obtained.
For the ViT candidate region scoring module, it is trained using the CELoss loss function, which is expressed as:
Figure BDA0003762840710000083
wherein, y i,k Real tag K representing the ith COB candidate region, and having K tag values N COB candidate regions, p' i,k And the probability of predicting the ith COB candidate region as the kth label value is represented, and the inter-class distance is also increased to a certain extent through fitting to a CELoss loss function.
S33, in the process of carrying out iterative training on the labeled natural image data set by the two module branches, any weight is not frozen, and the training process is as follows:
s331, classifying the labeled natural image data set by using a Vision Transformer network to obtain a multi-label classification result and generate a class activation graph;
specifically, the Vision Transformer network comprises a convolution layer, L cascaded Transformer blocks and a global average pooling layer; each transform block comprises a linear transformation layer, a multi-head self-attention layer and a multi-layer sensing block MLP;
inputting a labeled natural image data set into a Vision Transformer network, cutting each natural image in the labeled natural image data set into w multiplied by h image blocks, performing convolution operation through a convolution layer to obtain a one-dimensional vector, and obtaining a N = w multiplied by h block label t; then adding category marks to the block marks
Figure BDA0003762840710000084
D represents the dimension of each block label;
sending all block marks added with category marks into L cascaded transform blocks for feature extraction to obtain a feature matrix S of the image c And L attention vectors
Figure BDA0003762840710000085
Note the book
Figure BDA0003762840710000086
For the input of the first transformer block, in the attention operation of the first transformer block, the output block label is calculated by:
Figure BDA0003762840710000091
wherein the parameter matrix
Figure BDA0003762840710000092
And
Figure BDA0003762840710000093
respectively representing linear transformation layers before the attention operation is carried out on the first transformer block; matrix A l Is an attention matrix whose first row is a category labeled attention vector
Figure BDA0003762840710000094
Attention vector
Figure BDA0003762840710000095
Records the dependency of the class label on the labels of other image blocks, and the attention vector when the loss function works
Figure BDA0003762840710000096
Approaching the object region that is useful for the classification task.
Feature matrix of image
Figure BDA0003762840710000097
Inputting the convolutional layer and the global average pooling layer to obtain a multi-label classification result;
for L attention vectors
Figure BDA0003762840710000098
Calculating an average value, and deforming according to the position of the image block in the natural image to obtain an attention diagram, wherein the formula is as follows:
Figure BDA0003762840710000099
A′ * =Γ w×h (A * )
wherein, gamma is w×h (. Cndot.) is a deformation function;
the attention map is multiplied by the feature matrix of the image element by element to produce a class activation map TS-CAM, denoted as
Figure BDA00037628407100000910
The element multiplication formula is:
Figure BDA00037628407100000911
s332, inputting the labeled natural image data set into a candidate region generator, and generating a COB candidate region by using a convolution-oriented boundary algorithm and a hierarchical segmentation algorithm;
a Convolutional guided boundary (COB) is based on MCG, a deep Convolutional neural network is used to obtain edge information of an image, the algorithm realizes an end-to-end learnable Convolutional neural network to detect the edge and the corresponding direction of the image, and only one-time forward propagation of the image level is needed to generate multi-scale contour information and estimate the direction of the edge; the COB candidate regions are generated by using the MCG hierarchical segmentation algorithm using these multi-scale contour information and edge directions.
S333, obtaining a candidate region pseudo label by using a candidate region pseudo label generator according to the category activation graph and the COB candidate region;
as shown in FIG. 2, in the candidate region pseudo label generator, local peaks on the class activation map are obtained
Figure BDA00037628407100000912
The method is as follows:
firstly, a certain class of activation graph M is taken out according to a multi-label classification result c (ii) a The graph M is then activated in categories c Executing maximum pooling operation, wherein the size of a pooling core is mxm, and the center of the pooling core traverses each position of the category activation graph and records a local maximum value and a corresponding position coordinate; when the position coordinate of the local maximum value recorded at a certain pixel on the category activation map is just the position coordinate of the pixel, the local maximum value is recorded as a local peak value of the category activation map
Figure BDA0003762840710000101
Using each peak value
Figure BDA0003762840710000102
And COB candidate region
Figure BDA0003762840710000103
Obtain an auxiliary mask
Figure BDA0003762840710000104
The specific operation is as follows:
for each local peak
Figure BDA0003762840710000105
Finding all COB candidate regions containing the local peak value, averaging the COB candidate regions, and obtaining an auxiliary mask corresponding to the local peak value through threshold value extraction
Figure BDA0003762840710000106
Namely:
Figure BDA0003762840710000107
Figure BDA0003762840710000108
wherein
Figure BDA0003762840710000109
Means that the local peak point is included
Figure BDA00037628407100001010
The number of COB candidate regions, p ∈ [0]And q ∈ [0]Is an integer and represents a coordinate index, H and W represent the height and width of the natural image, respectively, and the threshold value beta belongs to [0,1 ]]Is a hyper-parameter.
Masking the auxiliary mask
Figure BDA00037628407100001011
According to local peak value
Figure BDA00037628407100001012
Is arranged in ascending order, and a certain COB candidate area R is calculated in sequence n And auxiliary mask
Figure BDA00037628407100001013
Degree of overlap of
Figure BDA00037628407100001014
If the overlapping degree exceeds a certain threshold value lambda, the false label z of the COB candidate area is determined n The label being of class c, i.e. z n = c; wherein the degree of overlap IOU represents the ratio of the overlapping portion of the two regions to the portion of the set of the two regions; the threshold λ =0.5 in this embodiment.
If the degree of overlap between a certain COB candidate region and all auxiliary masks is lower than the threshold, the COB candidate region is marked as a background class.
S334, inputting the COB candidate region into a ViT candidate region generator, generating a feature vector of the COB candidate region by adopting a SegAlign method and a full connection layer, and mapping the feature vector into a classification score and a classification of the COB candidate region through the full connection layer;
specifically, as shown in fig. 3, in the ViT candidate region feature generator, the category activation map is divided into n × n image blocks, and a Vision Transformer network is input to obtain a feature vector Token embedding of each image block; in this embodiment, the class activation map is divided into 14 × 14 image blocks, and the obtained feature vector dimension is 768 dimensions.
Splicing the feature vectors Token embedding of all image blocks into a 196 x 768 feature matrix according to the sequence, and then reconstructing (reshape) the spliced feature matrix, namely the feature matrix containing n x n (14 x 14) feature vectors, into a new feature matrix according to the position of each image block in the corresponding natural image;
inputting the new Feature matrix into 1 × 1 convolution conv, and fusing the features of each channel to obtain a Feature layer F (Feature Maps); the function of the 1 × 1 convolution is to fuse the features of each channel, and only change the number of channels of the feature map without changing the width and height dimensions of the feature map;
obtaining the features of each COB candidate region on the feature layer F by using a SegAlign method, aligning the features to obtain the alignment features
Figure BDA00037628407100001015
Aligning feature f n Leveling to one dimension, inputting the one dimension into three full-connection layers, and obtaining classification scores of COB candidate regions after Softmax
Figure BDA00037628407100001016
In this embodiment, the number of nodes in the three fully connected layers is 4096, and C (number of labels).
SegAlign is an improved version of RoiAlign and can be applied to a COB candidate region, as shown in FIG. 4, segAlign outputs an alignment feature F corresponding to a candidate region according to an input feature layer F and the candidate region R corresponding to the graph n Specifically, the following steps:
for a candidate region R, the receptive field on the feature map F is R F Sense of correspondence to the candidate region RReceptor field R F The circumscribed rectangle is B, if function
Figure BDA0003762840710000111
Is a bilinear transformation from spatial coordinates (i, j) ∈ f to (i ', j') ∈ B, that is to say
Figure BDA0003762840710000112
Is a bilinear interpolation function on the feature map F, the SegAlign operator can be expressed as:
Figure BDA0003762840710000113
in this formula, the channel dimensions of the feature layer are not represented in the formula for simplicity and convenience of representation.
And S335, calculating a loss value, optimizing a loss function, and iteratively training until the function is converged to obtain a trained weak supervision instance segmentation model.
And S4, inputting the natural image to be segmented into the trained weak supervision example segmentation model to obtain an example segmentation result.
Inputting a natural image into a COB candidate region generator to generate a candidate region
Figure BDA0003762840710000114
And then inputting the obtained data into a ViT candidate region scoring module to obtain the category and score of each candidate region, and finally obtaining a final example segmentation result through non-maximum suppression (NMS).
It should be noted that, for the sake of simplicity, the foregoing method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present invention is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present invention.
Based on the same idea as the Vision Transformer-based weakly supervised instance segmentation method in the embodiment, the invention further provides a Vision Transformer-based weakly supervised instance segmentation system, which can be used for executing the Vision Transformer-based weakly supervised instance segmentation method. For convenience of explanation, in the structural schematic diagram of the embodiment of the Vision Transformer-based weak supervision example segmentation system, only the part related to the embodiment of the present invention is shown, and those skilled in the art will understand that the illustrated structure does not constitute a limitation to the apparatus, and may include more or less components than those illustrated, or combine some components, or arrange different components.
Referring to fig. 5, in another embodiment of the present application, a Vision Transformer-based weakly supervised instance partitioning system is provided, which includes a data acquisition module, a model construction module, a model training module, and an instance partitioning module;
the data acquisition module is used for acquiring a labeled natural image data set and a natural image to be segmented;
the model building module is used for building a weak supervision instance segmentation model and comprises a ViT multi-label classification module and a ViT candidate region scoring module; the ViT multi-label classification module comprises a Vision Transformer network and a candidate region pseudo label generator; the ViT candidate region scoring module comprises a candidate region generator and a ViT candidate region feature generator;
the model training module is used for initializing the weak supervision example segmentation model, constructing a loss function, performing iterative training on a labeled natural image data set, and optimizing the loss function to obtain a trained weak supervision example segmentation model;
and the example segmentation module is used for inputting the natural image to be segmented into the trained weak supervision example segmentation model to obtain an example segmentation result.
It should be noted that, the Vision Transformer based weak supervision example segmentation system of the present invention corresponds to the Vision Transformer based weak supervision example segmentation method of the present invention one to one, and the technical features and the beneficial effects described in the embodiment of the Vision Transformer based weak supervision example segmentation method are all applicable to the embodiment of the Vision Transformer based weak supervision example segmentation system, and specific contents may refer to the description in the embodiment of the method of the present invention, and are not described herein again, and this is declared.
In addition, in the implementation of the Vision Transformer-based weakly supervised instance partitioning system according to the above embodiment, the logical partitioning of each program module is only an example, and in practical applications, the above function allocation may be performed by different program modules according to needs, for example, due to the configuration requirements of corresponding hardware or the convenience of software implementation, that is, the internal structure of the Vision Transformer-based weakly supervised instance partitioning system is partitioned into different program modules, so as to perform all or part of the above described functions.
Referring to fig. 6, in an embodiment, a computer-readable storage medium is provided, which stores a program in a memory, and when the program is executed by a processor, the method for partitioning a weakly supervised instance based on a Vision Transformer may be implemented, where the method includes:
acquiring a labeled natural image data set and a natural image to be segmented;
constructing a weak supervision instance segmentation model, which comprises a ViT multi-label classification module and a ViT candidate region evaluation module; the ViT multi-label classification module comprises a Vision Transformer network and a candidate region pseudo label generator; the ViT candidate region scoring module comprises a candidate region generator and a ViT candidate region feature generator;
the Vision Transformer network is used for acquiring a multi-label classification result and generating a class activation graph; a candidate region pseudo label generator generates candidate region pseudo labels according to the category activation graph; a candidate region generator generates a COB candidate region by using a convolution-oriented boundary algorithm and a hierarchical segmentation algorithm; a ViT candidate region feature generator generates feature vectors of the COB candidate regions by adopting a SegAlign method, and the feature vectors are mapped into classification scores of the COB candidate regions through a full connection layer;
initializing a weak supervision instance segmentation model, constructing a loss function, performing iterative training on a labeled natural image data set, and optimizing the loss function to obtain a trained weak supervision instance segmentation model;
and inputting the natural image to be segmented into the trained weak supervision example segmentation model to obtain an example segmentation result.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by a computer program, which may be stored in a non-volatile computer readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct Rambus Dynamic RAM (DRDRAM), and Rambus Dynamic RAM (RDRAM), among others.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such modifications are intended to be included in the scope of the present invention.

Claims (10)

1. The weak supervision instance segmentation method based on the Vision Transformer network is characterized by comprising the following steps of:
acquiring a natural image data set with a label and a natural image to be segmented;
constructing a weak supervision example segmentation model; the weak supervision instance segmentation model comprises a ViT multi-label classification module and a ViT candidate region scoring module; the ViT multi-label classification module comprises a Vision Transformer network and a candidate region pseudo label generator; the ViT candidate region scoring module comprises a candidate region generator and a ViT candidate region feature generator;
the Vision Transformer network is used for acquiring a multi-label classification result and generating a class activation graph; the candidate area pseudo label generator generates candidate area pseudo labels according to the category activation graph; the candidate region generator generates a COB candidate region by using a convolution-oriented boundary algorithm and a hierarchical segmentation algorithm; the ViT candidate region feature generator generates a feature vector of the COB candidate region by adopting a SegAlign method, and the feature vector passes through a full connection layer and is mapped to be a classification score of the COB candidate region;
initializing a weak supervision instance segmentation model, constructing a loss function, performing iterative training on a labeled natural image data set, and optimizing the loss function to obtain a trained weak supervision instance segmentation model;
and inputting the natural image to be segmented into the trained weak supervision example segmentation model to obtain an example segmentation result.
2. The Vision Transformer network-based weakly supervised instance segmentation method of claim 1, wherein the tagged natural image dataset is represented as:
Figure FDA0003762840700000011
wherein, X i Representing the ith labeled natural image, Y i A label representing the ith natural image;
Figure FDA0003762840700000012
representing the number of images in the tagged natural image dataset, C representing the number of tags;
before iterative training is carried out on the segmentation model of the weak supervision example by using a labeled natural image data set, randomly cutting natural images in the labeled natural image data set into images with set sizes, randomly horizontally turning the images, and carrying out standardization processing according to channels;
the initialization of the weak supervision example segmentation model is to pre-train the weak supervision example segmentation model on a large image data set, and use model parameters after pre-training as initialization parameters.
3. The Vision Transformer network based weakly supervised instance segmentation method of claim 2, wherein the Loss functions comprise a Focal local Loss function and a CELOSs Loss function;
the Focal local Loss function is used for training a ViT multi-label classification module and is expressed as follows:
Figure FDA0003762840700000013
wherein y is the true tag, p t Is the prediction probability, p t Is defined as follows:
Figure FDA0003762840700000014
wherein, p is the output value of the Vision Transformer network without any activation function processing, and the acquisition mode is as follows:
dividing a natural image with an input size of W × H into W × H image blocks, each image block including P × P pixels, where W = W/P and H = W/P; inputting the image block into a Vision Transformer network to output a characteristic matrix, mapping the characteristic matrix into a C-dimensional prediction score vector through a convolution layer and a global average pooling layer, wherein p is an output value of the Vision Transformer network;
the CELoss loss function is used to train the ViT candidate region scoring module, and is expressed as:
Figure FDA0003762840700000021
wherein, y i,k Real tag K representing the ith COB candidate region, and having K tag values N COB candidate regions, p' i,k Indicates the probability that the ith COB candidate region is predicted to be the kth tag value.
4. The Vision Transformer network-based weakly supervised instance segmentation method of claim 3, wherein the iterative training is performed on the labeled natural image dataset, specifically:
classifying the natural image dataset with the labels by using a Vision Transformer network to obtain a multi-label classification result and generate a class activation graph;
inputting the labeled natural image data set into a candidate region generator, and generating a COB candidate region by using a convolution-oriented boundary algorithm and a hierarchical segmentation algorithm;
obtaining a candidate area pseudo label by adopting a candidate area pseudo label generator according to the category activation graph and the COB candidate area;
inputting the COB candidate region into a ViT candidate region generator, generating a feature vector of the COB candidate region by adopting a SegAlign method and a full connection layer, and mapping the feature vector into a classification score and a classification of the COB candidate region through the full connection layer;
calculating a loss value and optimizing a loss function, and performing iterative training until the function is converged to obtain a trained weak supervision instance segmentation model.
5. The Vision Transformer network based weakly supervised instance partitioning method of claim 4, wherein the Vision Transformer network comprises convolution layers, L cascaded Transformer blocks and a global average pooling layer; the transformer blocks comprise a linear transformation layer, a multi-head self-attention layer and a multi-layer sensing block;
the obtaining of the multi-label classification result and the generation of the class activation graph specifically include:
inputting a labeled natural image data set into a Vision Transformer network, cutting each natural image with the size of W multiplied by H in the labeled natural image data set into W multiplied by H image blocks, performing convolution operation through a convolution layer to obtain a one-dimensional vector, and obtaining N block labels t; adding category labels to block labels
Figure FDA0003762840700000022
D represents the dimension of each block label;
sending all block marks added with category marks into L cascaded transform blocks for feature extraction to obtain a feature matrix S of the image c And L attention vectors
Figure FDA0003762840700000023
Feature matrix S of image c Inputting the convolutional layer and the global average pooling layer to obtain a multi-label classification result;
for L attention vectors
Figure FDA0003762840700000024
Calculating an average value, and deforming according to the position of the image block in the natural image to obtain an attention diagram, wherein the formula is as follows:
Figure FDA0003762840700000031
A′ * =Γ w×h (A * )
wherein, gamma is w×h (. Cndot.) is a deformation function;
the attention map and the feature matrix of the image are multiplied element by element to produce a class activation map TS-CAM, denoted as
Figure FDA0003762840700000032
The element multiplication formula is:
Figure FDA0003762840700000033
6. the Vision Transformer network-based weak supervision instance segmentation method according to claim 5, wherein the candidate region pseudo label obtained by using the candidate region pseudo label generator is specifically:
obtaining local peak values on class activation map in candidate region pseudo label generator
Figure FDA0003762840700000034
Using each local peak
Figure FDA0003762840700000035
And COB candidate region
Figure FDA0003762840700000036
Get the auxiliary mask
Figure FDA0003762840700000037
Masking the auxiliary mask
Figure FDA0003762840700000038
According to local peak value
Figure FDA0003762840700000039
Is arranged in ascending order, and a certain COB candidate region R is calculated in sequence n And auxiliary mask
Figure FDA00037628407000000310
Degree of overlap of
Figure FDA00037628407000000311
If the overlapping degree exceeds a certain threshold value lambda, the COB candidate area is subjected to false labelingSign z n The label is of class c, i.e. z n = c; the degree of overlap IOU represents the proportion of the overlapping part of the two areas to the part of the two area sets;
if the degree of overlap between a certain COB candidate region and all auxiliary masks is lower than the threshold, the COB candidate region is marked as a background class.
7. The Vision Transformer network based weakly supervised instance segmentation method of claim 6, wherein the obtaining of the local peak on the class activation graph
Figure FDA00037628407000000312
The method specifically comprises the following steps:
taking out a certain class of activation graph M according to the multi-label classification result c
Activation of graph M in categories c Executing maximum pooling operation, wherein the size of a pooling core is m multiplied by m, traversing each position of the category activation graph by the center of the pooling core, and recording a local maximum value and a corresponding position coordinate;
when the local maximum position coordinate recorded at a certain pixel on the category activation map is exactly the position coordinate of the pixel, it is recorded as a local peak value
Figure FDA00037628407000000313
Said using each local peak
Figure FDA00037628407000000314
And COB candidate region
Figure FDA00037628407000000315
Obtain an auxiliary mask
Figure FDA00037628407000000316
The specific operation is as follows:
for each local peak
Figure FDA00037628407000000317
Finding all COB candidate regions containing the local peak value, averaging, and obtaining an auxiliary mask corresponding to the local peak value through threshold value selection
Figure FDA00037628407000000318
Namely:
Figure FDA00037628407000000319
Figure FDA0003762840700000041
wherein
Figure FDA0003762840700000042
Means that the local peak point is included
Figure FDA0003762840700000043
The number of COB candidate regions, p ∈ [0]And q ∈ [0,W]Is an integer, represents a coordinate index, and has a threshold value beta epsilon [0,1 ]]Is a hyper-parameter.
8. The Vision Transformer network-based weakly supervised instance partitioning method of claim 6, wherein the mapping is classification scores and categories of COB candidate regions, specifically:
in a ViT candidate region feature generator, dividing a category activation graph into n multiplied by n image blocks, and inputting a Vision Transformer network to obtain a feature vector of each image block;
splicing the feature vectors of all image blocks into a feature matrix in sequence, and then reconstructing the spliced feature matrix into a new feature matrix according to the position of each image block in the corresponding natural image;
inputting the new feature matrix into the 1 × 1 convolution, and fusing the features of each channel to obtain a feature layer F;
obtaining the features of each COB candidate region on the feature layer F by using a SegAlign method, aligning the features to obtain the alignment features
Figure FDA0003762840700000044
Aligning feature f n Leveling to one dimension, inputting the one dimension into three full-connection layers, and obtaining classification scores of COB candidate regions after Softmax
Figure FDA0003762840700000045
9. The Vision Transformer network-based weak supervision instance segmentation system is characterized by being applied to the Vision Transformer network-based weak supervision instance segmentation method of any one of claims 1-8, and comprising a data acquisition module, a model construction module, a model training module and an instance segmentation module;
the data acquisition module is used for acquiring a labeled natural image data set and a natural image to be segmented;
the model construction module is used for constructing a weak supervision instance segmentation model; the weak supervision example segmentation model comprises a ViT multi-label classification module and a ViT candidate region evaluation module; the ViT multi-label classification module comprises a Vision Transformer network and a candidate area pseudo label generator; the ViT candidate region scoring module comprises a candidate region generator and a ViT candidate region feature generator;
the model training module is used for initializing a weak supervision instance segmentation model, constructing a loss function, performing iterative training on a labeled natural image data set, and optimizing the loss function to obtain a trained weak supervision instance segmentation model;
the example segmentation module is used for inputting the natural image to be segmented into the trained weak supervision example segmentation model to obtain an example segmentation result.
10. A computer-readable storage medium storing a program, wherein the program, when executed by a processor, implements the Vision Transformer network-based weakly supervised instance segmentation method of any of claims 1 to 8.
CN202210877230.0A 2022-07-25 2022-07-25 Vision transform network-based weak supervision instance segmentation method, system and medium Pending CN115359254A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210877230.0A CN115359254A (en) 2022-07-25 2022-07-25 Vision transform network-based weak supervision instance segmentation method, system and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210877230.0A CN115359254A (en) 2022-07-25 2022-07-25 Vision transform network-based weak supervision instance segmentation method, system and medium

Publications (1)

Publication Number Publication Date
CN115359254A true CN115359254A (en) 2022-11-18

Family

ID=84031992

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210877230.0A Pending CN115359254A (en) 2022-07-25 2022-07-25 Vision transform network-based weak supervision instance segmentation method, system and medium

Country Status (1)

Country Link
CN (1) CN115359254A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116342627A (en) * 2023-05-23 2023-06-27 山东大学 Intestinal epithelial metaplasia area image segmentation system based on multi-instance learning
CN116363372A (en) * 2023-06-01 2023-06-30 之江实验室 Weak supervision semantic segmentation method, device, equipment and storage medium
CN116385455A (en) * 2023-05-22 2023-07-04 北京科技大学 Flotation foam image example segmentation method and device based on gradient field label
CN116403015A (en) * 2023-03-13 2023-07-07 武汉大学 Unsupervised target re-identification method and system based on perception-aided learning transducer model
CN117333485A (en) * 2023-11-30 2024-01-02 华南理工大学 WSI survival prediction method based on weak supervision depth ordinal regression network
CN117372701A (en) * 2023-12-07 2024-01-09 厦门瑞为信息技术有限公司 Interactive image segmentation method based on Transformer
CN118096799A (en) * 2024-04-29 2024-05-28 浙江大学 Hybrid weakly-supervised wafer SEM defect segmentation method and system
CN118135425A (en) * 2024-05-07 2024-06-04 江西啄木蜂科技有限公司 Method for detecting change of concerned area in natural protected area

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116403015A (en) * 2023-03-13 2023-07-07 武汉大学 Unsupervised target re-identification method and system based on perception-aided learning transducer model
CN116403015B (en) * 2023-03-13 2024-05-03 武汉大学 Unsupervised target re-identification method and system based on perception-aided learning transducer model
CN116385455B (en) * 2023-05-22 2024-01-26 北京科技大学 Flotation foam image example segmentation method and device based on gradient field label
CN116385455A (en) * 2023-05-22 2023-07-04 北京科技大学 Flotation foam image example segmentation method and device based on gradient field label
CN116342627B (en) * 2023-05-23 2023-09-08 山东大学 Intestinal epithelial metaplasia area image segmentation system based on multi-instance learning
CN116342627A (en) * 2023-05-23 2023-06-27 山东大学 Intestinal epithelial metaplasia area image segmentation system based on multi-instance learning
CN116363372A (en) * 2023-06-01 2023-06-30 之江实验室 Weak supervision semantic segmentation method, device, equipment and storage medium
CN116363372B (en) * 2023-06-01 2023-08-15 之江实验室 Weak supervision semantic segmentation method, device, equipment and storage medium
CN117333485A (en) * 2023-11-30 2024-01-02 华南理工大学 WSI survival prediction method based on weak supervision depth ordinal regression network
CN117333485B (en) * 2023-11-30 2024-04-05 华南理工大学 WSI survival prediction method based on weak supervision depth ordinal regression network
CN117372701B (en) * 2023-12-07 2024-03-12 厦门瑞为信息技术有限公司 Interactive image segmentation method based on Transformer
CN117372701A (en) * 2023-12-07 2024-01-09 厦门瑞为信息技术有限公司 Interactive image segmentation method based on Transformer
CN118096799A (en) * 2024-04-29 2024-05-28 浙江大学 Hybrid weakly-supervised wafer SEM defect segmentation method and system
CN118135425A (en) * 2024-05-07 2024-06-04 江西啄木蜂科技有限公司 Method for detecting change of concerned area in natural protected area

Similar Documents

Publication Publication Date Title
CN115359254A (en) Vision transform network-based weak supervision instance segmentation method, system and medium
Qian et al. 3D object detection for autonomous driving: A survey
CN110781262B (en) Semantic map construction method based on visual SLAM
CN112395957B (en) Online learning method for video target detection
CN110782420A (en) Small target feature representation enhancement method based on deep learning
CN112686274B (en) Target object detection method and device
CN112488083A (en) Traffic signal lamp identification method, device and medium for extracting key points based on heatmap
CN114332893A (en) Table structure identification method and device, computer equipment and storage medium
CN112446431A (en) Feature point extraction and matching method, network, device and computer storage medium
Zhang et al. Appearance-based loop closure detection via locality-driven accurate motion field learning
CN109190662A (en) A kind of three-dimensional vehicle detection method, system, terminal and storage medium returned based on key point
Yu et al. LiDAR-based localization using universal encoding and memory-aware regression
Lv et al. Memory‐augmented neural networks based dynamic complex image segmentation in digital twins for self‐driving vehicle
CN112418207B (en) Weak supervision character detection method based on self-attention distillation
US20210224646A1 (en) Method for generating labeled data, in particular for training a neural network, by improving initial labels
US20230298335A1 (en) Computer-implemented method, data processing apparatus and computer program for object detection
Shi et al. Dense semantic 3D map based long-term visual localization with hybrid features
CN116189130A (en) Lane line segmentation method and device based on image annotation model
CN114913519B (en) 3D target detection method and device, electronic equipment and storage medium
CN116071719A (en) Lane line semantic segmentation method and device based on model dynamic correction
Plachetka et al. DNN-based map deviation detection in LiDAR point clouds
CN114596588A (en) Damaged pedestrian image re-identification method and device based on text auxiliary feature alignment model
CN111338336B (en) Automatic driving method and device
Guo et al. Udtiri: An open-source road pothole detection benchmark suite
Saranya et al. Semantic annotation of land cover remote sensing images using fuzzy CNN

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination