CN117830638A

CN117830638A - Omnidirectional supervision semantic segmentation method based on prompt text

Info

Publication number: CN117830638A
Application number: CN202410239251.9A
Authority: CN
Inventors: 孙晓帅; 黄鸣浪; 纪荣嵘; 周毅奕
Original assignee: Xiamen University
Current assignee: Xiamen University
Priority date: 2024-03-04
Filing date: 2024-03-04
Publication date: 2024-04-05

Abstract

The invention provides an omnibearing supervision semantic segmentation method based on a prompt text, which can effectively utilize various low-cost image labels to reduce the manual labeling cost of a training data set, achieve the aim of reducing the training cost of the semantic segmentation method, improve the performance and generalization of a semantic segmentation model, and guide the model to screen semantic segmentation targets in an image by combining a visual language multi-modal model and input the prompt text, and locate the positions of the targets in the image by the prompt text. The semantic segmentation method is improved based on a teacher-student model framework and utilizes an image omnibearing label supervision training model of manual annotation, and comprises the following steps: step 1, calculating teacher-student model frame during omnibearing supervisionLoss function of rackThe method comprises the steps of carrying out a first treatment on the surface of the Step 2, updating the weight of the teacher model through an index moving average algorithm。

Description

Omnidirectional supervision semantic segmentation method based on prompt text

Technical Field

The invention belongs to the technical field of semantic segmentation, relates to a method for positioning semantic segmentation areas in an image through a semantic segmentation model, and particularly relates to a method for completing a general semantic segmentation task through a prompt word guidance model in a data set with more weak label forms based on a prompt text and an omnibearing supervision-oriented semantic segmentation model.

Background

With significant advances in semantic segmentation technology, complex features and patterns have been able to be learned from a large number of annotated images. One major challenge with semantic segmentation technology is that its data annotation requires the creation of a large dataset, which requires a significant amount of time and effort to produce, as each instance must be matched to its corresponding textual description. This way of annotating, in particular creating a segmentation mask for each instance, is very time-consuming and labor-intensive, which also greatly restricts further developments in this field. Meanwhile, the traditional semantic segmentation task regards semantic segmentation as a classification task, namely targets in images can be screened only in limited categories, and targets required by the user cannot be screened out from a plurality of similar objects in the images through text information such as azimuth words, number words and the like.

In the field of computer vision, there are a large number of data sets of high quality but with only non-semantically separated labels (such as dots, sketches, and boxes, etc.). For example, the very popular MS COCO dataset provides more than hundreds of thousands of instances with a target localization box corresponding to the textual description, while more recently similar datasets have been used to improve the performance of target detection, a task similar to semantic segmentation. Although there are many semi-supervised or weakly supervised methods in the field of semantic segmentation, none of these methods fully utilize the low cost and high availability weak tag datasets described above, and therefore the quality of pseudo tag production is also unstable during training iterations. Meanwhile, with the advent of CLIP models in the multi-modal field, we also need to fuse text information with visual information, so as to realize the capability of text information to guide semantic segmentation to designate image targets. Therefore, the problem of how to combine the low-cost tag dataset with the visual speech multi-mode model needs to be solved, so as to provide a technical scheme for locating the position of the target in the image through the prompt text.

Disclosure of Invention

The invention aims to provide an omnibearing supervision semantic segmentation method based on a prompt text, which can effectively utilize various low-cost image labels (such as dots, grass coats, frames and the like) to reduce the manual labeling cost of a training data set, achieve the purpose of reducing the training cost of the semantic segmentation method, improve the performance and generalization of a semantic segmentation model, and simultaneously guide a model to screen semantic segmentation targets in an image by combining a visual language multi-modal model and input the prompt text, and locate the positions of the targets in the image by the prompt text.

In order to achieve the above object, the solution of the present invention is:

the full-scope supervision semantic segmentation method based on the prompt text and based on a teacher-student model framework of a semi-supervision computer vision direction comprises a teacher model and a student model, and the training model is supervised by utilizing an image full-scope label of manual annotation, and the method comprises the following steps:

step 1, calculating a loss function of a teacher-student model framework during omnibearing supervision

Wherein the method comprises the steps ofLoss function representing the total supervision part between the semantic segmentation result output by the computational student model and the semantic segmentation label, +.>Loss function representing an omnibearing supervision part between semantic segmentation results output by a calculation student model and a teacher model,/>Indicating the super-parameters for adjusting the weight of the omnibearing supervision loss function;

the calculation formula of (2) is

The calculation formula of (2) is

Wherein the method comprises the steps ofWeights representing student model +.>Representing an input image with semantic segmentation tags, +.>Representing an input text consisting of character strings, +.>Semantic segmentation label representing an input image, +.>Representing the semantic segmentation result of the student model output, +.>The pseudo tag which is formed by filtering and filtering the semantic segmentation result output by the teacher model is represented,the calculation formula of (2) is

Wherein the method comprises the steps ofOmnidirectional label representing manually marked image +.>Representing semantic segmentation results output by the teacher model, +.>A method for representing active pseudo tag screening;

step 2, updating the weight of the teacher model through an index moving average algorithm

Wherein the method comprises the steps ofRepresent the firstkWeights of teacher model at next iteration, +.>Represent the firstkThe weight of the student model at the time of the iteration,representing the update coefficients.

In the step (1) of the above-mentioned process,the calculation formula of (2) is composed of the following formula

Wherein the method comprises the steps ofRepresenting a decoder model; />Representing the feature matrix after multi-mode fusion and taking the feature matrix as the input of a decoder model; />Representing a multi-modal fusion model; />And->Respectively representing the results of the image feature matrix and the text feature matrix through linear projection, and taking the results as the input of the multi-mode fusion model; />Representing a linear projection layer with the aim of making +.>And->The number of channels is kept consistent; />Representing an image feature matrix output by the visual encoder model; />Representing a text feature matrix output by the text encoder model; />Representing a visual encoder model; />Representing a text encoder model;

except for about>The rest is set with +.>Is consistent with the calculation formula of (2).

In the step 1, a ResNet model is used as a visual encoder, a text encoding part of a CLIP model is used as a text encoder, a ViT model is subjected to multi-mode fusion, and a decoding part of a DeepLabv3+ model is used as a decoder to calculate a loss function in semantic segmentation in a teacher model and a student modelProviding entered prompt text +.>Ablation experiments and omnibearing supervision training are carried out.

In the step 1, the super parameter is calculatedSet to 1.

In the step 1, a text is inputText requiring positive and negative information for each word to be fed back to teacher model and student modelIn the present encoder, it is defined as follows

Wherein the method comprises the steps ofRepresenting the input text +.>Positive and negative of->Representing the input vocabulary.

In the step 1, the manually marked image omnibearing label comprises points, grass coating and frames.

Preferably, in the step 1, the process of screening the pseudo tag of the point by the active pseudo tag screening method is defined as follows

Wherein the method comprises the steps ofRepresenting pseudo tags selected by dot tag, < +.>Coordinate information representing a point tag, +.>Representing semantic segmentation regions before screening, +.>And representing the intersection part of the selected point tag and the semantic segmentation area as a result of pseudo tag screening.

Preferably, in the step 1, the process of screening the coated pseudo tag by the active pseudo tag screening method is defined as follows

Wherein the method comprises the steps ofIndicating pseudo tags selected by grass-coated tags, < - > or->Representing the pixels occupied by the grass-coated label, < >>Representing semantic segmentation regions before screening, +.>The union part of the sketched label and the semantic segmentation area is selected as a result of the pseudo label screening.

Preferably, in the step 1, the process of screening the pseudo tag of the frame by the active pseudo tag screening method is defined as follows

Wherein the method comprises the steps ofRepresenting pseudo tags selected by box tags, < +.>Representing frame tag information,/-, for>When the value of (1) is 0, it is out of the box, and when it is 1, it is in the box,/->Representing semantic segmentation regions before screening, +.>And->Respectively representing the length and width of the frame, +.>Representing a preset threshold value,/->The selection of a semantically partitioned area that exists only within the target frame is indicated as a result of pseudo tag screening when the ratio of pixels occupied by the area to pixels occupied by the frame is greater than a threshold.

In the training process of the teacher-student model framework, an Adam optimizer is used, the initial learning rate is set to be 0.0001, and the coefficients are updatedSet to 0.9996, and set the positive and negative thresholds of the teacher model output probability to 0.7 and 0.3, respectively.

After the technical scheme is adopted, the invention has the following technical effects:

on the basis of a semantic segmentation model, on one hand, by combining a multi-mode model, the invention supports the execution of semantic segmentation tasks through the input of a prompt text, such as the segmentation of specific targets in images by inputting target names, numbers, azimuth words and the like, solves the problem that the traditional semantic segmentation method can only segment targets in limited categories, and can realize the positioning of the targets prompted by the input text on the basis of the traditional semantic segmentation method; on the other hand, by combining an omnibearing supervised learning method, the method is improved on the existing semi-supervised learning framework, so that various image labels including dot, grass coating, frame and semantic segmentation labels can be utilized to train the semantic segmentation model, the manual labeling cost of the semantic segmentation label is reduced, the higher performance than that of the semi-supervised learning mode is achieved, and the generalization of the semantic segmentation model is improved; meanwhile, an active pseudo tag screening method is provided, the problem that the quality of generated pseudo tags is low in the traditional pseudo tag acquisition method is solved, and the probability of the occurrence of the over-fitting condition of the model in the iteration process is reduced.

Drawings

FIG. 1 is a schematic diagram of a frame according to an embodiment of the present invention.

Detailed Description

It should be noted that, the concept of Prompt text (Prompt) is applied to the field of natural language processing and is widely used in the field of multi-modal. In visual language tasks, it can help multimodal models understand image content through entered prompt text words, such as identifying objects in an image. The concept of omnibearing supervision (Omni-supervision) was first proposed in the UFO2 model, which is a target detection model based on the fast RCNN framework, which regards omnibearing supervision learning as a more general semi-Supervised learning mode and is an enhanced version of semi-Supervised learning. The method is characterized in that on the basis of semi-supervised utilization of unlabeled images, various available image Labels (such as dot, grass coating, frame and semantic segmentation Labels, which are collectively called Omni-Labels) can be mixed to train a visual model, the aim of reducing the manual labeling cost by utilizing Labels with lower cost is fulfilled, and better performance compared with semi-supervised learning is achieved.

In order to further explain the technical scheme of the invention, the invention is explained in detail by specific examples.

Referring to FIG. 1, the invention discloses an omnibearing supervision semantic segmentation method based on a prompt text, which comprises a model implementation process and a model training process.

1. Model implementation process

1.1 Inputting an input image with a semantic segmentation label and an input text composed of character strings into a student model to obtain a semantic segmentation result output by the student model, and calculating the semantic segmentation result and a full supervision loss function of the semantic segmentation labelThe detailed process is as described in 1.1.1-1.1.4:

1.1.1 The input image is subjected to strong enhancement processing (such as gaussian blur, color difference dithering and the like), and is input to the student model together with the input text, and the input image and the input text are encoded by using a visual encoder model and a text encoder model of the student model, respectively, and a feature matrix is output (as shown in the lower left part of fig. 1):

wherein the method comprises the steps ofRepresenting dimensions H W C _x An image feature matrix output by the visual encoder model; />Representing a visual encoder model; />Representing an input image of dimension H x W x 3 after the strong enhancement processing (H, W, C thereof _x Respectively representing the width, height and channel number of the image, 3 representing that the image is composed of three primary colors of red, green and blue), respectively,/-for the image>Representing dimension T C _t Is a text feature matrix (T, C) _t Representing the length and number of channels of the text feature matrix), both of which are input to the multimodal fusion model; />Representing text encoder model,/->Representing the input text;

1.1.2 Multi-modal fusion of the image feature matrix and the text feature matrix (as shown in the lower left of fig. 1):

and->Respectively representing the results of the image feature matrix and the text feature matrix through linear projection; />Representing a linear projection layer, the purpose of which is to keep the channel numbers of an image feature matrix and a text feature matrix input into a multi-modal model consistent, namely C can be realized _x =C _t ；/>Representing a multi-modal fused feature matrix (H, W, C representing the number of wide, high and channels) of dimension H W C and serving as input to the decoder model,/A->Representing a multi-modal fusion model;

1.1.3 Inputting the feature matrix subjected to multi-modal fusion into a decoder model to obtain a target segmentation result of an input image, namely obtaining a semantic segmentation result of a student model (shown in the lower left part of fig. 1):

wherein the method comprises the steps ofSemantic segmentation result representing model output of dimension H W,>the weights of the student model are represented,representing a decoder model;

1.1.4 Calculating a loss function of a full supervision part between a semantic segmentation result output by the student model and a semantic segmentation label(as shown in the lower left of fig. 1):

wherein the method comprises the steps ofThe semantic segmentation labels with dimensions H W are represented.

1.2 Inputting an input image and an input text without semantic segmentation labels into a teacher model and a student model simultaneously to obtain semantic segmentation results output by the two models, and calculating an omnibearing supervision loss functionThe detailed process is as follows:

1.2.1 The input image without semantic segmentation labels is subjected to strong enhancement processing and weak enhancement processing simultaneously, and is input into a teacher model together with an input text, and finally the semantic segmentation result of the teacher model is obtained(as shown in the upper left part of FIG. 1),>except for about>The rest is set with +.>Is consistent with the calculation formula of (2).

1.2.2 Semantic segmentation results obtained for 1.2.1After pseudo tag screening and filtering (detailed in 1.3), the loss function of the omnibearing supervision part is calculated +.>：

Wherein the method comprises the steps ofAnd the pseudo tag with the dimension of H multiplied by W is formed by filtering and filtering the semantic segmentation result target output by the teacher model.

1.3 The semantic segmentation result output by the teacher model is screened and filtered (as shown in the upper right part of fig. 1) by utilizing the omnibearing labels (such as dots, grass coatings and frames) of the manually marked images, and finally pseudo labels are generated and provided for the student model to train:

wherein the method comprises the steps ofOmnidirectional label representing manually marked image +.>Representing semantic segmentation results output by the teacher model, +.>The method for Active pseudo tag screening (as shown in the right part of fig. 1) actively screens pseudo tags needing to participate in training a model by using information such as tag positions, threshold values and the like in an Active Learning (Active Learning) mode, continuously eliminates low-quality pseudo tags in an iteration process, thereby improving the quality of the screened pseudo tags, and reducing the probability of the model having an over-fitting condition in the iteration processThe rate.

1.3.1 At the position ofThe formula of the pseudo tag of the screening point by the active pseudo tag screening method is defined as follows:

1.3.2 At the position ofThe formula of the method is defined as follows:

1.3.3 At the position ofThe formula of the pseudo tag of the screening frame by the active pseudo tag screening method is defined as follows:

wherein the method comprises the steps ofRepresenting pseudo tags selected by box tags, < +.>Representing frame tag information,/-, for>When the value of (1) is 0, it is out of the box, and when it is 1, it is in the box,/->Representing semantic segmentation regions before screening, +.>And->Respectively representing the length and width of the frame, +.>Representing a preset threshold value, default to 0.2 +.>Representing selection of a semantically partitioned region that exists only within the target frame and that occupies pixels at a ratio to pixels occupied by the frame that is greater thanThreshold as a result of pseudo tag screening.

1.4 For a particular semantic segmentation task, its loss function is calculated using the results from 1.1 and 1.2 (as shown in the lower right part of fig. 1):

wherein the method comprises the steps ofRepresenting the hyper-parameters that adjust the weight of the omnibearing monitor loss function.

1.5 Updating the weights of the teacher model by an exponential moving average algorithm (EMA) (as shown to the left in fig. 1):

2. Model training process:

2.1 training model:

using a ResNet model as a visual encoder, a language coding part of a CLIP model as a language encoder, and a decoding part of a deep Labv3+ model as a decoder to calculate a loss function in a teacher-student model framework facing an omnibearing supervision semantic segmentation method based on a prompt text; providing entered hint text using three data sets of Pascal VOC 2012, cityscapes and MS COCOAblation experiments and omnibearing supervision training are carried out. The Pascal VOC 2012 includes 10582 training samples, the Cityscapes includes 2975 high-resolution training samples and 500 verification samples, the MS COCO includes about 11.8 ten thousand training samples and about 5 thousand verification samples, and the prompt text data is randomly generated by counting the information of the number, the azimuth, and the like of the image targets in the data set.

2.2 Model training parameter setting:

in the training process, we use Adam optimizer and set initial learning rate to 0.0001 and update coefficientSetting 0.9996 as well as positive threshold and negative threshold of probability of outputting semantic segmentation result by teacher model as 0.7 and 0.3 respectively, and comprehensively supervising super parameter of loss function weight ++>Set to 1, the batch size of the dataset set to 64, and the training iteration round set to 40. In the model hyper-parameters, both the image width and height H and W are set to 480 by default, the channel number C to 768 by default, and the text length T to 40 by default.

The above examples and drawings are not intended to limit the form or form of the present invention, and any suitable variations or modifications thereof by those skilled in the art should be construed as not departing from the scope of the present invention.

Claims

1. The full-scope supervision semantic segmentation method based on the prompt text and based on a teacher-student model framework of a semi-supervision computer vision direction comprises a teacher model and a student model, and the training model is supervised by utilizing an image full-scope label of manual annotation, and is characterized by comprising the following steps:

the calculation formula of (2) is

The calculation formula of (2) is

Wherein the method comprises the steps ofWeights representing student model +.>Representing an input image with semantic segmentation tags, +.>Representing the word byInput text consisting of strings, < >>Semantic segmentation label representing an input image, +.>Representing the semantic segmentation result of the student model output, +.>Pseudo tag which is formed by filtering and screening semantic segmentation results output by a teacher model, and ++>The calculation formula of (2) is

Wherein the method comprises the steps ofRepresent the firstkThe weight of the teacher model at the time of the iteration,/>represent the firstkWeights of student model at the time of iteration, +.>Representing the update coefficients.

2. The hint text-based omnidirectional supervised semantic segmentation method as set forth in claim 1, wherein:

Wherein the method comprises the steps ofRepresenting a decoder model; />Representing the feature matrix after multi-mode fusion and taking the feature matrix as the input of a decoder model; />Representing a multi-modal fusion model; />And->Respectively representing the results of the image feature matrix and the text feature matrix through linear projection, and taking the results as the input of the multi-mode fusion model; />Representing a linear projection layer with the aim of making +.>Andthe number of channels is kept consistent; />Representing an image feature matrix output by the visual encoder model; />Representing a text feature matrix output by the text encoder model; />Representing a visual encoder model; />Representing a text encoder model;

3. The hint text-based omnidirectional supervised semantic segmentation method as set forth in claim 1, wherein:

4. The hint text-based omnidirectional supervised semantic segmentation method as set forth in claim 1, wherein:

in the step 1, the super parameter is calculatedSet to 1.

5. The hint text-based omnidirectional supervised semantic segmentation method as set forth in claim 1, wherein:

in the step 1, a text is inputText encoder requiring positive and negative information for each word to be fed back to teacher model and student model is defined as follows

6. The hint text-based omnidirectional supervised semantic segmentation method as set forth in claim 1, wherein:

7. The hint text-based omnidirectional supervised semantic segmentation method of claim 6, wherein:

in the step 1, the process of screening the pseudo tag of the point by the active pseudo tag screening method is defined as follows

8. The hint text-based omnidirectional supervised semantic segmentation method of claim 6, wherein:

in the step 1, the process of screening the grass-coated pseudo tag by the active pseudo tag screening method is defined as follows

9. The hint text-based omnidirectional supervised semantic segmentation method of claim 6, wherein:

in the step 1, the process of screening the pseudo tags of the frame by the active pseudo tag screening method is defined as follows

10. The hint text-based omnidirectional supervised semantic segmentation method as set forth in claim 1, wherein: