CN117830638A - Omnidirectional supervision semantic segmentation method based on prompt text - Google Patents

Omnidirectional supervision semantic segmentation method based on prompt text Download PDF

Info

Publication number
CN117830638A
CN117830638A CN202410239251.9A CN202410239251A CN117830638A CN 117830638 A CN117830638 A CN 117830638A CN 202410239251 A CN202410239251 A CN 202410239251A CN 117830638 A CN117830638 A CN 117830638A
Authority
CN
China
Prior art keywords
representing
model
semantic segmentation
text
teacher
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410239251.9A
Other languages
Chinese (zh)
Inventor
孙晓帅
黄鸣浪
纪荣嵘
周毅奕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen University
Original Assignee
Xiamen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen University filed Critical Xiamen University
Priority to CN202410239251.9A priority Critical patent/CN117830638A/en
Publication of CN117830638A publication Critical patent/CN117830638A/en
Pending legal-status Critical Current

Links

Landscapes

  • Image Analysis (AREA)

Abstract

The invention provides an omnibearing supervision semantic segmentation method based on a prompt text, which can effectively utilize various low-cost image labels to reduce the manual labeling cost of a training data set, achieve the aim of reducing the training cost of the semantic segmentation method, improve the performance and generalization of a semantic segmentation model, and guide the model to screen semantic segmentation targets in an image by combining a visual language multi-modal model and input the prompt text, and locate the positions of the targets in the image by the prompt text. The semantic segmentation method is improved based on a teacher-student model framework and utilizes an image omnibearing label supervision training model of manual annotation, and comprises the following steps: step 1, calculating teacher-student model frame during omnibearing supervisionLoss function of rackThe method comprises the steps of carrying out a first treatment on the surface of the Step 2, updating the weight of the teacher model through an index moving average algorithm

Description

Omnidirectional supervision semantic segmentation method based on prompt text
Technical Field
The invention belongs to the technical field of semantic segmentation, relates to a method for positioning semantic segmentation areas in an image through a semantic segmentation model, and particularly relates to a method for completing a general semantic segmentation task through a prompt word guidance model in a data set with more weak label forms based on a prompt text and an omnibearing supervision-oriented semantic segmentation model.
Background
With significant advances in semantic segmentation technology, complex features and patterns have been able to be learned from a large number of annotated images. One major challenge with semantic segmentation technology is that its data annotation requires the creation of a large dataset, which requires a significant amount of time and effort to produce, as each instance must be matched to its corresponding textual description. This way of annotating, in particular creating a segmentation mask for each instance, is very time-consuming and labor-intensive, which also greatly restricts further developments in this field. Meanwhile, the traditional semantic segmentation task regards semantic segmentation as a classification task, namely targets in images can be screened only in limited categories, and targets required by the user cannot be screened out from a plurality of similar objects in the images through text information such as azimuth words, number words and the like.
In the field of computer vision, there are a large number of data sets of high quality but with only non-semantically separated labels (such as dots, sketches, and boxes, etc.). For example, the very popular MS COCO dataset provides more than hundreds of thousands of instances with a target localization box corresponding to the textual description, while more recently similar datasets have been used to improve the performance of target detection, a task similar to semantic segmentation. Although there are many semi-supervised or weakly supervised methods in the field of semantic segmentation, none of these methods fully utilize the low cost and high availability weak tag datasets described above, and therefore the quality of pseudo tag production is also unstable during training iterations. Meanwhile, with the advent of CLIP models in the multi-modal field, we also need to fuse text information with visual information, so as to realize the capability of text information to guide semantic segmentation to designate image targets. Therefore, the problem of how to combine the low-cost tag dataset with the visual speech multi-mode model needs to be solved, so as to provide a technical scheme for locating the position of the target in the image through the prompt text.
Disclosure of Invention
The invention aims to provide an omnibearing supervision semantic segmentation method based on a prompt text, which can effectively utilize various low-cost image labels (such as dots, grass coats, frames and the like) to reduce the manual labeling cost of a training data set, achieve the purpose of reducing the training cost of the semantic segmentation method, improve the performance and generalization of a semantic segmentation model, and simultaneously guide a model to screen semantic segmentation targets in an image by combining a visual language multi-modal model and input the prompt text, and locate the positions of the targets in the image by the prompt text.
In order to achieve the above object, the solution of the present invention is:
the full-scope supervision semantic segmentation method based on the prompt text and based on a teacher-student model framework of a semi-supervision computer vision direction comprises a teacher model and a student model, and the training model is supervised by utilizing an image full-scope label of manual annotation, and the method comprises the following steps:
step 1, calculating a loss function of a teacher-student model framework during omnibearing supervision
Wherein the method comprises the steps ofLoss function representing the total supervision part between the semantic segmentation result output by the computational student model and the semantic segmentation label, +.>Loss function representing an omnibearing supervision part between semantic segmentation results output by a calculation student model and a teacher model,/>Indicating the super-parameters for adjusting the weight of the omnibearing supervision loss function;
the calculation formula of (2) is
The calculation formula of (2) is
Wherein the method comprises the steps ofWeights representing student model +.>Representing an input image with semantic segmentation tags, +.>Representing an input text consisting of character strings, +.>Semantic segmentation label representing an input image, +.>Representing the semantic segmentation result of the student model output, +.>The pseudo tag which is formed by filtering and filtering the semantic segmentation result output by the teacher model is represented,the calculation formula of (2) is
Wherein the method comprises the steps ofOmnidirectional label representing manually marked image +.>Representing semantic segmentation results output by the teacher model, +.>A method for representing active pseudo tag screening;
step 2, updating the weight of the teacher model through an index moving average algorithm
Wherein the method comprises the steps ofRepresent the firstkWeights of teacher model at next iteration, +.>Represent the firstkThe weight of the student model at the time of the iteration,representing the update coefficients.
In the step (1) of the above-mentioned process,the calculation formula of (2) is composed of the following formula
Wherein the method comprises the steps ofRepresenting a decoder model; />Representing the feature matrix after multi-mode fusion and taking the feature matrix as the input of a decoder model; />Representing a multi-modal fusion model; />And->Respectively representing the results of the image feature matrix and the text feature matrix through linear projection, and taking the results as the input of the multi-mode fusion model; />Representing a linear projection layer with the aim of making +.>And->The number of channels is kept consistent; />Representing an image feature matrix output by the visual encoder model; />Representing a text feature matrix output by the text encoder model; />Representing a visual encoder model; />Representing a text encoder model;
except for about>The rest is set with +.>Is consistent with the calculation formula of (2).
In the step 1, a ResNet model is used as a visual encoder, a text encoding part of a CLIP model is used as a text encoder, a ViT model is subjected to multi-mode fusion, and a decoding part of a DeepLabv3+ model is used as a decoder to calculate a loss function in semantic segmentation in a teacher model and a student modelProviding entered prompt text +.>Ablation experiments and omnibearing supervision training are carried out.
In the step 1, the super parameter is calculatedSet to 1.
In the step 1, a text is inputText requiring positive and negative information for each word to be fed back to teacher model and student modelIn the present encoder, it is defined as follows
Wherein the method comprises the steps ofRepresenting the input text +.>Positive and negative of->Representing the input vocabulary.
In the step 1, the manually marked image omnibearing label comprises points, grass coating and frames.
Preferably, in the step 1, the process of screening the pseudo tag of the point by the active pseudo tag screening method is defined as follows
Wherein the method comprises the steps ofRepresenting pseudo tags selected by dot tag, < +.>Coordinate information representing a point tag, +.>Representing semantic segmentation regions before screening, +.>And representing the intersection part of the selected point tag and the semantic segmentation area as a result of pseudo tag screening.
Preferably, in the step 1, the process of screening the coated pseudo tag by the active pseudo tag screening method is defined as follows
Wherein the method comprises the steps ofIndicating pseudo tags selected by grass-coated tags, < - > or->Representing the pixels occupied by the grass-coated label, < >>Representing semantic segmentation regions before screening, +.>The union part of the sketched label and the semantic segmentation area is selected as a result of the pseudo label screening.
Preferably, in the step 1, the process of screening the pseudo tag of the frame by the active pseudo tag screening method is defined as follows
Wherein the method comprises the steps ofRepresenting pseudo tags selected by box tags, < +.>Representing frame tag information,/-, for>When the value of (1) is 0, it is out of the box, and when it is 1, it is in the box,/->Representing semantic segmentation regions before screening, +.>And->Respectively representing the length and width of the frame, +.>Representing a preset threshold value,/->The selection of a semantically partitioned area that exists only within the target frame is indicated as a result of pseudo tag screening when the ratio of pixels occupied by the area to pixels occupied by the frame is greater than a threshold.
In the training process of the teacher-student model framework, an Adam optimizer is used, the initial learning rate is set to be 0.0001, and the coefficients are updatedSet to 0.9996, and set the positive and negative thresholds of the teacher model output probability to 0.7 and 0.3, respectively.
After the technical scheme is adopted, the invention has the following technical effects:
on the basis of a semantic segmentation model, on one hand, by combining a multi-mode model, the invention supports the execution of semantic segmentation tasks through the input of a prompt text, such as the segmentation of specific targets in images by inputting target names, numbers, azimuth words and the like, solves the problem that the traditional semantic segmentation method can only segment targets in limited categories, and can realize the positioning of the targets prompted by the input text on the basis of the traditional semantic segmentation method; on the other hand, by combining an omnibearing supervised learning method, the method is improved on the existing semi-supervised learning framework, so that various image labels including dot, grass coating, frame and semantic segmentation labels can be utilized to train the semantic segmentation model, the manual labeling cost of the semantic segmentation label is reduced, the higher performance than that of the semi-supervised learning mode is achieved, and the generalization of the semantic segmentation model is improved; meanwhile, an active pseudo tag screening method is provided, the problem that the quality of generated pseudo tags is low in the traditional pseudo tag acquisition method is solved, and the probability of the occurrence of the over-fitting condition of the model in the iteration process is reduced.
Drawings
FIG. 1 is a schematic diagram of a frame according to an embodiment of the present invention.
Detailed Description
It should be noted that, the concept of Prompt text (Prompt) is applied to the field of natural language processing and is widely used in the field of multi-modal. In visual language tasks, it can help multimodal models understand image content through entered prompt text words, such as identifying objects in an image. The concept of omnibearing supervision (Omni-supervision) was first proposed in the UFO2 model, which is a target detection model based on the fast RCNN framework, which regards omnibearing supervision learning as a more general semi-Supervised learning mode and is an enhanced version of semi-Supervised learning. The method is characterized in that on the basis of semi-supervised utilization of unlabeled images, various available image Labels (such as dot, grass coating, frame and semantic segmentation Labels, which are collectively called Omni-Labels) can be mixed to train a visual model, the aim of reducing the manual labeling cost by utilizing Labels with lower cost is fulfilled, and better performance compared with semi-supervised learning is achieved.
In order to further explain the technical scheme of the invention, the invention is explained in detail by specific examples.
Referring to FIG. 1, the invention discloses an omnibearing supervision semantic segmentation method based on a prompt text, which comprises a model implementation process and a model training process.
1. Model implementation process
1.1 Inputting an input image with a semantic segmentation label and an input text composed of character strings into a student model to obtain a semantic segmentation result output by the student model, and calculating the semantic segmentation result and a full supervision loss function of the semantic segmentation labelThe detailed process is as described in 1.1.1-1.1.4:
1.1.1 The input image is subjected to strong enhancement processing (such as gaussian blur, color difference dithering and the like), and is input to the student model together with the input text, and the input image and the input text are encoded by using a visual encoder model and a text encoder model of the student model, respectively, and a feature matrix is output (as shown in the lower left part of fig. 1):
wherein the method comprises the steps ofRepresenting dimensions H W C x An image feature matrix output by the visual encoder model; />Representing a visual encoder model; />Representing an input image of dimension H x W x 3 after the strong enhancement processing (H, W, C thereof x Respectively representing the width, height and channel number of the image, 3 representing that the image is composed of three primary colors of red, green and blue), respectively,/-for the image>Representing dimension T C t Is a text feature matrix (T, C) t Representing the length and number of channels of the text feature matrix), both of which are input to the multimodal fusion model; />Representing text encoder model,/->Representing the input text;
1.1.2 Multi-modal fusion of the image feature matrix and the text feature matrix (as shown in the lower left of fig. 1):
and->Respectively representing the results of the image feature matrix and the text feature matrix through linear projection; />Representing a linear projection layer, the purpose of which is to keep the channel numbers of an image feature matrix and a text feature matrix input into a multi-modal model consistent, namely C can be realized x =C t ;/>Representing a multi-modal fused feature matrix (H, W, C representing the number of wide, high and channels) of dimension H W C and serving as input to the decoder model,/A->Representing a multi-modal fusion model;
1.1.3 Inputting the feature matrix subjected to multi-modal fusion into a decoder model to obtain a target segmentation result of an input image, namely obtaining a semantic segmentation result of a student model (shown in the lower left part of fig. 1):
wherein the method comprises the steps ofSemantic segmentation result representing model output of dimension H W,>the weights of the student model are represented,representing a decoder model;
1.1.4 Calculating a loss function of a full supervision part between a semantic segmentation result output by the student model and a semantic segmentation label(as shown in the lower left of fig. 1):
wherein the method comprises the steps ofThe semantic segmentation labels with dimensions H W are represented.
1.2 Inputting an input image and an input text without semantic segmentation labels into a teacher model and a student model simultaneously to obtain semantic segmentation results output by the two models, and calculating an omnibearing supervision loss functionThe detailed process is as follows:
1.2.1 The input image without semantic segmentation labels is subjected to strong enhancement processing and weak enhancement processing simultaneously, and is input into a teacher model together with an input text, and finally the semantic segmentation result of the teacher model is obtained(as shown in the upper left part of FIG. 1),>except for about>The rest is set with +.>Is consistent with the calculation formula of (2).
1.2.2 Semantic segmentation results obtained for 1.2.1After pseudo tag screening and filtering (detailed in 1.3), the loss function of the omnibearing supervision part is calculated +.>
Wherein the method comprises the steps ofAnd the pseudo tag with the dimension of H multiplied by W is formed by filtering and filtering the semantic segmentation result target output by the teacher model.
1.3 The semantic segmentation result output by the teacher model is screened and filtered (as shown in the upper right part of fig. 1) by utilizing the omnibearing labels (such as dots, grass coatings and frames) of the manually marked images, and finally pseudo labels are generated and provided for the student model to train:
wherein the method comprises the steps ofOmnidirectional label representing manually marked image +.>Representing semantic segmentation results output by the teacher model, +.>The method for Active pseudo tag screening (as shown in the right part of fig. 1) actively screens pseudo tags needing to participate in training a model by using information such as tag positions, threshold values and the like in an Active Learning (Active Learning) mode, continuously eliminates low-quality pseudo tags in an iteration process, thereby improving the quality of the screened pseudo tags, and reducing the probability of the model having an over-fitting condition in the iteration processThe rate.
1.3.1 At the position ofThe formula of the pseudo tag of the screening point by the active pseudo tag screening method is defined as follows:
wherein the method comprises the steps ofRepresenting pseudo tags selected by dot tag, < +.>Coordinate information representing a point tag, +.>Representing semantic segmentation regions before screening, +.>And representing the intersection part of the selected point tag and the semantic segmentation area as a result of pseudo tag screening.
1.3.2 At the position ofThe formula of the method is defined as follows:
wherein the method comprises the steps ofIndicating pseudo tags selected by grass-coated tags, < - > or->Representing the pixels occupied by the grass-coated label, < >>Representing semantic segmentation regions before screening, +.>The union part of the sketched label and the semantic segmentation area is selected as a result of the pseudo label screening.
1.3.3 At the position ofThe formula of the pseudo tag of the screening frame by the active pseudo tag screening method is defined as follows:
wherein the method comprises the steps ofRepresenting pseudo tags selected by box tags, < +.>Representing frame tag information,/-, for>When the value of (1) is 0, it is out of the box, and when it is 1, it is in the box,/->Representing semantic segmentation regions before screening, +.>And->Respectively representing the length and width of the frame, +.>Representing a preset threshold value, default to 0.2 +.>Representing selection of a semantically partitioned region that exists only within the target frame and that occupies pixels at a ratio to pixels occupied by the frame that is greater thanThreshold as a result of pseudo tag screening.
1.4 For a particular semantic segmentation task, its loss function is calculated using the results from 1.1 and 1.2 (as shown in the lower right part of fig. 1):
wherein the method comprises the steps ofRepresenting the hyper-parameters that adjust the weight of the omnibearing monitor loss function.
1.5 Updating the weights of the teacher model by an exponential moving average algorithm (EMA) (as shown to the left in fig. 1):
wherein the method comprises the steps ofRepresent the firstkWeights of teacher model at next iteration, +.>Represent the firstkThe weight of the student model at the time of the iteration,representing the update coefficients.
2. Model training process:
2.1 training model:
using a ResNet model as a visual encoder, a language coding part of a CLIP model as a language encoder, and a decoding part of a deep Labv3+ model as a decoder to calculate a loss function in a teacher-student model framework facing an omnibearing supervision semantic segmentation method based on a prompt text; providing entered hint text using three data sets of Pascal VOC 2012, cityscapes and MS COCOAblation experiments and omnibearing supervision training are carried out. The Pascal VOC 2012 includes 10582 training samples, the Cityscapes includes 2975 high-resolution training samples and 500 verification samples, the MS COCO includes about 11.8 ten thousand training samples and about 5 thousand verification samples, and the prompt text data is randomly generated by counting the information of the number, the azimuth, and the like of the image targets in the data set.
2.2 Model training parameter setting:
in the training process, we use Adam optimizer and set initial learning rate to 0.0001 and update coefficientSetting 0.9996 as well as positive threshold and negative threshold of probability of outputting semantic segmentation result by teacher model as 0.7 and 0.3 respectively, and comprehensively supervising super parameter of loss function weight ++>Set to 1, the batch size of the dataset set to 64, and the training iteration round set to 40. In the model hyper-parameters, both the image width and height H and W are set to 480 by default, the channel number C to 768 by default, and the text length T to 40 by default.
The above examples and drawings are not intended to limit the form or form of the present invention, and any suitable variations or modifications thereof by those skilled in the art should be construed as not departing from the scope of the present invention.

Claims (10)

1. The full-scope supervision semantic segmentation method based on the prompt text and based on a teacher-student model framework of a semi-supervision computer vision direction comprises a teacher model and a student model, and the training model is supervised by utilizing an image full-scope label of manual annotation, and is characterized by comprising the following steps:
step 1, calculating a loss function of a teacher-student model framework during omnibearing supervision
Wherein the method comprises the steps ofLoss function representing the total supervision part between the semantic segmentation result output by the computational student model and the semantic segmentation label, +.>Loss function representing an omnibearing supervision part between semantic segmentation results output by a calculation student model and a teacher model,/>Indicating the super-parameters for adjusting the weight of the omnibearing supervision loss function;
the calculation formula of (2) is
The calculation formula of (2) is
Wherein the method comprises the steps ofWeights representing student model +.>Representing an input image with semantic segmentation tags, +.>Representing the word byInput text consisting of strings, < >>Semantic segmentation label representing an input image, +.>Representing the semantic segmentation result of the student model output, +.>Pseudo tag which is formed by filtering and screening semantic segmentation results output by a teacher model, and ++>The calculation formula of (2) is
Wherein the method comprises the steps ofOmnidirectional label representing manually marked image +.>Representing semantic segmentation results output by the teacher model, +.>A method for representing active pseudo tag screening;
step 2, updating the weight of the teacher model through an index moving average algorithm
Wherein the method comprises the steps ofRepresent the firstkThe weight of the teacher model at the time of the iteration,/>represent the firstkWeights of student model at the time of iteration, +.>Representing the update coefficients.
2. The hint text-based omnidirectional supervised semantic segmentation method as set forth in claim 1, wherein:
in the step (1) of the above-mentioned process,the calculation formula of (2) is composed of the following formula
Wherein the method comprises the steps ofRepresenting a decoder model; />Representing the feature matrix after multi-mode fusion and taking the feature matrix as the input of a decoder model; />Representing a multi-modal fusion model; />And->Respectively representing the results of the image feature matrix and the text feature matrix through linear projection, and taking the results as the input of the multi-mode fusion model; />Representing a linear projection layer with the aim of making +.>Andthe number of channels is kept consistent; />Representing an image feature matrix output by the visual encoder model; />Representing a text feature matrix output by the text encoder model; />Representing a visual encoder model; />Representing a text encoder model;
except for about>The rest is set with +.>Is consistent with the calculation formula of (2).
3. The hint text-based omnidirectional supervised semantic segmentation method as set forth in claim 1, wherein:
in the step 1, a ResNet model is used as a visual encoder, a text encoding part of a CLIP model is used as a text encoder, a ViT model is subjected to multi-mode fusion, and a decoding part of a DeepLabv3+ model is used as a decoder to calculate a loss function in semantic segmentation in a teacher model and a student modelProviding entered prompt text +.>Ablation experiments and omnibearing supervision training are carried out.
4. The hint text-based omnidirectional supervised semantic segmentation method as set forth in claim 1, wherein:
in the step 1, the super parameter is calculatedSet to 1.
5. The hint text-based omnidirectional supervised semantic segmentation method as set forth in claim 1, wherein:
in the step 1, a text is inputText encoder requiring positive and negative information for each word to be fed back to teacher model and student model is defined as follows
Wherein the method comprises the steps ofRepresenting the input text +.>Positive and negative of->Representing the input vocabulary.
6. The hint text-based omnidirectional supervised semantic segmentation method as set forth in claim 1, wherein:
in the step 1, the manually marked image omnibearing label comprises points, grass coating and frames.
7. The hint text-based omnidirectional supervised semantic segmentation method of claim 6, wherein:
in the step 1, the process of screening the pseudo tag of the point by the active pseudo tag screening method is defined as follows
Wherein the method comprises the steps ofRepresenting pseudo tags selected by dot tag, < +.>Coordinate information representing a point tag, +.>Representing semantic segmentation regions before screening, +.>And representing the intersection part of the selected point tag and the semantic segmentation area as a result of pseudo tag screening.
8. The hint text-based omnidirectional supervised semantic segmentation method of claim 6, wherein:
in the step 1, the process of screening the grass-coated pseudo tag by the active pseudo tag screening method is defined as follows
Wherein the method comprises the steps ofIndicating pseudo tags selected by grass-coated tags, < - > or->Representing the pixels occupied by the grass-coated label, < >>Representing semantic segmentation regions before screening, +.>The union part of the sketched label and the semantic segmentation area is selected as a result of the pseudo label screening.
9. The hint text-based omnidirectional supervised semantic segmentation method of claim 6, wherein:
in the step 1, the process of screening the pseudo tags of the frame by the active pseudo tag screening method is defined as follows
Wherein the method comprises the steps ofRepresenting pseudo tags selected by box tags, < +.>Representing frame tag information,/-, for>When the value of (1) is 0, it is out of the box, and when it is 1, it is in the box,/->Representing semantic segmentation regions before screening, +.>And->Respectively representing the length and width of the frame, +.>Representing a preset threshold value,/->The selection of a semantically partitioned area that exists only within the target frame is indicated as a result of pseudo tag screening when the ratio of pixels occupied by the area to pixels occupied by the frame is greater than a threshold.
10. The hint text-based omnidirectional supervised semantic segmentation method as set forth in claim 1, wherein:
in the training process of the teacher-student model framework, an Adam optimizer is used, the initial learning rate is set to be 0.0001, and the coefficients are updatedSet to 0.9996, and set the positive and negative thresholds of the teacher model output probability to 0.7 and 0.3, respectively.
CN202410239251.9A 2024-03-04 2024-03-04 Omnidirectional supervision semantic segmentation method based on prompt text Pending CN117830638A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410239251.9A CN117830638A (en) 2024-03-04 2024-03-04 Omnidirectional supervision semantic segmentation method based on prompt text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410239251.9A CN117830638A (en) 2024-03-04 2024-03-04 Omnidirectional supervision semantic segmentation method based on prompt text

Publications (1)

Publication Number Publication Date
CN117830638A true CN117830638A (en) 2024-04-05

Family

ID=90523146

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410239251.9A Pending CN117830638A (en) 2024-03-04 2024-03-04 Omnidirectional supervision semantic segmentation method based on prompt text

Country Status (1)

Country Link
CN (1) CN117830638A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114140390A (en) * 2021-11-02 2022-03-04 广州大学 Crack detection method and device based on semi-supervised semantic segmentation
US20220156593A1 (en) * 2020-11-16 2022-05-19 Salesforce.Com, Inc. Systems and methods for video representation learning with a weak teacher
US20230093619A1 (en) * 2021-09-17 2023-03-23 Uif (University Industry Foundation), Yonsei University Weakly supervised semantic segmentation device and method based on pseudo-masks
CN115861164A (en) * 2022-09-16 2023-03-28 重庆邮电大学 Medical image segmentation method based on multi-field semi-supervision
CN116993975A (en) * 2023-07-11 2023-11-03 复旦大学 Panoramic camera semantic segmentation method based on deep learning unsupervised field adaptation
CN117058024A (en) * 2023-08-04 2023-11-14 淮阴工学院 Transformer-based efficient defogging semantic segmentation method and application thereof
CN117237648A (en) * 2023-11-16 2023-12-15 中国农业科学院农业资源与农业区划研究所 Training method, device and equipment of semantic segmentation model based on context awareness

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220156593A1 (en) * 2020-11-16 2022-05-19 Salesforce.Com, Inc. Systems and methods for video representation learning with a weak teacher
US20230093619A1 (en) * 2021-09-17 2023-03-23 Uif (University Industry Foundation), Yonsei University Weakly supervised semantic segmentation device and method based on pseudo-masks
CN114140390A (en) * 2021-11-02 2022-03-04 广州大学 Crack detection method and device based on semi-supervised semantic segmentation
CN115861164A (en) * 2022-09-16 2023-03-28 重庆邮电大学 Medical image segmentation method based on multi-field semi-supervision
CN116993975A (en) * 2023-07-11 2023-11-03 复旦大学 Panoramic camera semantic segmentation method based on deep learning unsupervised field adaptation
CN117058024A (en) * 2023-08-04 2023-11-14 淮阴工学院 Transformer-based efficient defogging semantic segmentation method and application thereof
CN117237648A (en) * 2023-11-16 2023-12-15 中国农业科学院农业资源与农业区划研究所 Training method, device and equipment of semantic segmentation model based on context awareness

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JIAMU SUN ET AL.: "RefTeacher: A Strong Baseline for Semi-Supervised Referring Expression Comprehension", 《2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION》, 22 August 2023 (2023-08-22), pages 19144 - 19151 *
滕国龙: "基于半监督学习的实时语义分割算法研究与应用", 《中国优秀硕士学位论文全文数据库》, no. 02, 15 February 2023 (2023-02-15), pages 20 - 72 *

Similar Documents

Publication Publication Date Title
CN110097131B (en) Semi-supervised medical image segmentation method based on countermeasure cooperative training
Jiang et al. Scfont: Structure-guided chinese font generation via deep stacked networks
Dvornik et al. On the importance of visual context for data augmentation in scene understanding
CN109299274B (en) Natural scene text detection method based on full convolution neural network
CN112887698B (en) High-quality face voice driving method based on nerve radiation field
CN111738251B (en) Optical character recognition method and device fused with language model and electronic equipment
CN108874174A (en) A kind of text error correction method, device and relevant device
CN111723585A (en) Style-controllable image text real-time translation and conversion method
CN113158862B (en) Multitasking-based lightweight real-time face detection method
CN107251059A (en) Sparse reasoning module for deep learning
CN110533024B (en) Double-quadratic pooling fine-grained image classification method based on multi-scale ROI (region of interest) features
CN109086768B (en) Semantic image segmentation method of convolutional neural network
CN111737511B (en) Image description method based on self-adaptive local concept embedding
CN111160533A (en) Neural network acceleration method based on cross-resolution knowledge distillation
Zhang et al. Efficient inductive vision transformer for oriented object detection in remote sensing imagery
CN113673338B (en) Automatic labeling method, system and medium for weak supervision of natural scene text image character pixels
CN110880176B (en) Semi-supervised industrial image defect segmentation method based on countermeasure generation network
CN112070114B (en) Scene character recognition method and system based on Gaussian constraint attention mechanism network
CN113807340B (en) Attention mechanism-based irregular natural scene text recognition method
CN114565808B (en) Double-action contrast learning method for unsupervised visual representation
CN111914555A (en) Automatic relation extraction system based on Transformer structure
CN113378949A (en) Dual-generation confrontation learning method based on capsule network and mixed attention
CN115718815A (en) Cross-modal retrieval method and system
Qu et al. Exploring stroke-level modifications for scene text editing
CN111739037A (en) Semantic segmentation method for indoor scene RGB-D image

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination