CN117830638A - Omnidirectional supervision semantic segmentation method based on prompt text - Google Patents
Omnidirectional supervision semantic segmentation method based on prompt text Download PDFInfo
- Publication number
- CN117830638A CN117830638A CN202410239251.9A CN202410239251A CN117830638A CN 117830638 A CN117830638 A CN 117830638A CN 202410239251 A CN202410239251 A CN 202410239251A CN 117830638 A CN117830638 A CN 117830638A
- Authority
- CN
- China
- Prior art keywords
- representing
- model
- semantic segmentation
- text
- teacher
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000011218 segmentation Effects 0.000 title claims abstract description 100
- 238000000034 method Methods 0.000 title claims abstract description 92
- 238000012549 training Methods 0.000 claims abstract description 23
- 230000006870 function Effects 0.000 claims abstract description 18
- 230000000007 visual effect Effects 0.000 claims abstract description 16
- 238000012216 screening Methods 0.000 claims description 40
- 239000011159 matrix material Substances 0.000 claims description 24
- 244000025254 Cannabis sativa Species 0.000 claims description 13
- 238000004364 calculation method Methods 0.000 claims description 13
- 230000004927 fusion Effects 0.000 claims description 12
- 238000001914 filtration Methods 0.000 claims description 6
- 238000000576 coating method Methods 0.000 claims description 5
- 239000011248 coating agent Substances 0.000 claims description 4
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 3
- 238000002679 ablation Methods 0.000 claims description 3
- 238000002474 experimental method Methods 0.000 claims description 3
- 238000002372 labelling Methods 0.000 abstract description 4
- 238000012545 processing Methods 0.000 description 4
- 238000001514 detection method Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 239000003086 colorant Substances 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000004807 localization Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
Landscapes
- Image Analysis (AREA)
Abstract
The invention provides an omnibearing supervision semantic segmentation method based on a prompt text, which can effectively utilize various low-cost image labels to reduce the manual labeling cost of a training data set, achieve the aim of reducing the training cost of the semantic segmentation method, improve the performance and generalization of a semantic segmentation model, and guide the model to screen semantic segmentation targets in an image by combining a visual language multi-modal model and input the prompt text, and locate the positions of the targets in the image by the prompt text. The semantic segmentation method is improved based on a teacher-student model framework and utilizes an image omnibearing label supervision training model of manual annotation, and comprises the following steps: step 1, calculating teacher-student model frame during omnibearing supervisionLoss function of rackThe method comprises the steps of carrying out a first treatment on the surface of the Step 2, updating the weight of the teacher model through an index moving average algorithm。
Description
Technical Field
The invention belongs to the technical field of semantic segmentation, relates to a method for positioning semantic segmentation areas in an image through a semantic segmentation model, and particularly relates to a method for completing a general semantic segmentation task through a prompt word guidance model in a data set with more weak label forms based on a prompt text and an omnibearing supervision-oriented semantic segmentation model.
Background
With significant advances in semantic segmentation technology, complex features and patterns have been able to be learned from a large number of annotated images. One major challenge with semantic segmentation technology is that its data annotation requires the creation of a large dataset, which requires a significant amount of time and effort to produce, as each instance must be matched to its corresponding textual description. This way of annotating, in particular creating a segmentation mask for each instance, is very time-consuming and labor-intensive, which also greatly restricts further developments in this field. Meanwhile, the traditional semantic segmentation task regards semantic segmentation as a classification task, namely targets in images can be screened only in limited categories, and targets required by the user cannot be screened out from a plurality of similar objects in the images through text information such as azimuth words, number words and the like.
In the field of computer vision, there are a large number of data sets of high quality but with only non-semantically separated labels (such as dots, sketches, and boxes, etc.). For example, the very popular MS COCO dataset provides more than hundreds of thousands of instances with a target localization box corresponding to the textual description, while more recently similar datasets have been used to improve the performance of target detection, a task similar to semantic segmentation. Although there are many semi-supervised or weakly supervised methods in the field of semantic segmentation, none of these methods fully utilize the low cost and high availability weak tag datasets described above, and therefore the quality of pseudo tag production is also unstable during training iterations. Meanwhile, with the advent of CLIP models in the multi-modal field, we also need to fuse text information with visual information, so as to realize the capability of text information to guide semantic segmentation to designate image targets. Therefore, the problem of how to combine the low-cost tag dataset with the visual speech multi-mode model needs to be solved, so as to provide a technical scheme for locating the position of the target in the image through the prompt text.
Disclosure of Invention
The invention aims to provide an omnibearing supervision semantic segmentation method based on a prompt text, which can effectively utilize various low-cost image labels (such as dots, grass coats, frames and the like) to reduce the manual labeling cost of a training data set, achieve the purpose of reducing the training cost of the semantic segmentation method, improve the performance and generalization of a semantic segmentation model, and simultaneously guide a model to screen semantic segmentation targets in an image by combining a visual language multi-modal model and input the prompt text, and locate the positions of the targets in the image by the prompt text.
In order to achieve the above object, the solution of the present invention is:
the full-scope supervision semantic segmentation method based on the prompt text and based on a teacher-student model framework of a semi-supervision computer vision direction comprises a teacher model and a student model, and the training model is supervised by utilizing an image full-scope label of manual annotation, and the method comprises the following steps:
step 1, calculating a loss function of a teacher-student model framework during omnibearing supervision
Wherein the method comprises the steps ofLoss function representing the total supervision part between the semantic segmentation result output by the computational student model and the semantic segmentation label, +.>Loss function representing an omnibearing supervision part between semantic segmentation results output by a calculation student model and a teacher model,/>Indicating the super-parameters for adjusting the weight of the omnibearing supervision loss function;
the calculation formula of (2) is
The calculation formula of (2) is
Wherein the method comprises the steps ofWeights representing student model +.>Representing an input image with semantic segmentation tags, +.>Representing an input text consisting of character strings, +.>Semantic segmentation label representing an input image, +.>Representing the semantic segmentation result of the student model output, +.>The pseudo tag which is formed by filtering and filtering the semantic segmentation result output by the teacher model is represented,the calculation formula of (2) is
Wherein the method comprises the steps ofOmnidirectional label representing manually marked image +.>Representing semantic segmentation results output by the teacher model, +.>A method for representing active pseudo tag screening;
step 2, updating the weight of the teacher model through an index moving average algorithm
Wherein the method comprises the steps ofRepresent the firstkWeights of teacher model at next iteration, +.>Represent the firstkThe weight of the student model at the time of the iteration,representing the update coefficients.
In the step (1) of the above-mentioned process,the calculation formula of (2) is composed of the following formula
Wherein the method comprises the steps ofRepresenting a decoder model; />Representing the feature matrix after multi-mode fusion and taking the feature matrix as the input of a decoder model; />Representing a multi-modal fusion model; />And->Respectively representing the results of the image feature matrix and the text feature matrix through linear projection, and taking the results as the input of the multi-mode fusion model; />Representing a linear projection layer with the aim of making +.>And->The number of channels is kept consistent; />Representing an image feature matrix output by the visual encoder model; />Representing a text feature matrix output by the text encoder model; />Representing a visual encoder model; />Representing a text encoder model;
except for about>The rest is set with +.>Is consistent with the calculation formula of (2).
In the step 1, a ResNet model is used as a visual encoder, a text encoding part of a CLIP model is used as a text encoder, a ViT model is subjected to multi-mode fusion, and a decoding part of a DeepLabv3+ model is used as a decoder to calculate a loss function in semantic segmentation in a teacher model and a student modelProviding entered prompt text +.>Ablation experiments and omnibearing supervision training are carried out.
In the step 1, the super parameter is calculatedSet to 1.
In the step 1, a text is inputText requiring positive and negative information for each word to be fed back to teacher model and student modelIn the present encoder, it is defined as follows
Wherein the method comprises the steps ofRepresenting the input text +.>Positive and negative of->Representing the input vocabulary.
In the step 1, the manually marked image omnibearing label comprises points, grass coating and frames.
Preferably, in the step 1, the process of screening the pseudo tag of the point by the active pseudo tag screening method is defined as follows
Wherein the method comprises the steps ofRepresenting pseudo tags selected by dot tag, < +.>Coordinate information representing a point tag, +.>Representing semantic segmentation regions before screening, +.>And representing the intersection part of the selected point tag and the semantic segmentation area as a result of pseudo tag screening.
Preferably, in the step 1, the process of screening the coated pseudo tag by the active pseudo tag screening method is defined as follows
Wherein the method comprises the steps ofIndicating pseudo tags selected by grass-coated tags, < - > or->Representing the pixels occupied by the grass-coated label, < >>Representing semantic segmentation regions before screening, +.>The union part of the sketched label and the semantic segmentation area is selected as a result of the pseudo label screening.
Preferably, in the step 1, the process of screening the pseudo tag of the frame by the active pseudo tag screening method is defined as follows
Wherein the method comprises the steps ofRepresenting pseudo tags selected by box tags, < +.>Representing frame tag information,/-, for>When the value of (1) is 0, it is out of the box, and when it is 1, it is in the box,/->Representing semantic segmentation regions before screening, +.>And->Respectively representing the length and width of the frame, +.>Representing a preset threshold value,/->The selection of a semantically partitioned area that exists only within the target frame is indicated as a result of pseudo tag screening when the ratio of pixels occupied by the area to pixels occupied by the frame is greater than a threshold.
In the training process of the teacher-student model framework, an Adam optimizer is used, the initial learning rate is set to be 0.0001, and the coefficients are updatedSet to 0.9996, and set the positive and negative thresholds of the teacher model output probability to 0.7 and 0.3, respectively.
After the technical scheme is adopted, the invention has the following technical effects:
on the basis of a semantic segmentation model, on one hand, by combining a multi-mode model, the invention supports the execution of semantic segmentation tasks through the input of a prompt text, such as the segmentation of specific targets in images by inputting target names, numbers, azimuth words and the like, solves the problem that the traditional semantic segmentation method can only segment targets in limited categories, and can realize the positioning of the targets prompted by the input text on the basis of the traditional semantic segmentation method; on the other hand, by combining an omnibearing supervised learning method, the method is improved on the existing semi-supervised learning framework, so that various image labels including dot, grass coating, frame and semantic segmentation labels can be utilized to train the semantic segmentation model, the manual labeling cost of the semantic segmentation label is reduced, the higher performance than that of the semi-supervised learning mode is achieved, and the generalization of the semantic segmentation model is improved; meanwhile, an active pseudo tag screening method is provided, the problem that the quality of generated pseudo tags is low in the traditional pseudo tag acquisition method is solved, and the probability of the occurrence of the over-fitting condition of the model in the iteration process is reduced.
Drawings
FIG. 1 is a schematic diagram of a frame according to an embodiment of the present invention.
Detailed Description
It should be noted that, the concept of Prompt text (Prompt) is applied to the field of natural language processing and is widely used in the field of multi-modal. In visual language tasks, it can help multimodal models understand image content through entered prompt text words, such as identifying objects in an image. The concept of omnibearing supervision (Omni-supervision) was first proposed in the UFO2 model, which is a target detection model based on the fast RCNN framework, which regards omnibearing supervision learning as a more general semi-Supervised learning mode and is an enhanced version of semi-Supervised learning. The method is characterized in that on the basis of semi-supervised utilization of unlabeled images, various available image Labels (such as dot, grass coating, frame and semantic segmentation Labels, which are collectively called Omni-Labels) can be mixed to train a visual model, the aim of reducing the manual labeling cost by utilizing Labels with lower cost is fulfilled, and better performance compared with semi-supervised learning is achieved.
In order to further explain the technical scheme of the invention, the invention is explained in detail by specific examples.
Referring to FIG. 1, the invention discloses an omnibearing supervision semantic segmentation method based on a prompt text, which comprises a model implementation process and a model training process.
1. Model implementation process
1.1 Inputting an input image with a semantic segmentation label and an input text composed of character strings into a student model to obtain a semantic segmentation result output by the student model, and calculating the semantic segmentation result and a full supervision loss function of the semantic segmentation labelThe detailed process is as described in 1.1.1-1.1.4:
1.1.1 The input image is subjected to strong enhancement processing (such as gaussian blur, color difference dithering and the like), and is input to the student model together with the input text, and the input image and the input text are encoded by using a visual encoder model and a text encoder model of the student model, respectively, and a feature matrix is output (as shown in the lower left part of fig. 1):
wherein the method comprises the steps ofRepresenting dimensions H W C x An image feature matrix output by the visual encoder model; />Representing a visual encoder model; />Representing an input image of dimension H x W x 3 after the strong enhancement processing (H, W, C thereof x Respectively representing the width, height and channel number of the image, 3 representing that the image is composed of three primary colors of red, green and blue), respectively,/-for the image>Representing dimension T C t Is a text feature matrix (T, C) t Representing the length and number of channels of the text feature matrix), both of which are input to the multimodal fusion model; />Representing text encoder model,/->Representing the input text;
1.1.2 Multi-modal fusion of the image feature matrix and the text feature matrix (as shown in the lower left of fig. 1):
and->Respectively representing the results of the image feature matrix and the text feature matrix through linear projection; />Representing a linear projection layer, the purpose of which is to keep the channel numbers of an image feature matrix and a text feature matrix input into a multi-modal model consistent, namely C can be realized x =C t ;/>Representing a multi-modal fused feature matrix (H, W, C representing the number of wide, high and channels) of dimension H W C and serving as input to the decoder model,/A->Representing a multi-modal fusion model;
1.1.3 Inputting the feature matrix subjected to multi-modal fusion into a decoder model to obtain a target segmentation result of an input image, namely obtaining a semantic segmentation result of a student model (shown in the lower left part of fig. 1):
wherein the method comprises the steps ofSemantic segmentation result representing model output of dimension H W,>the weights of the student model are represented,representing a decoder model;
1.1.4 Calculating a loss function of a full supervision part between a semantic segmentation result output by the student model and a semantic segmentation label(as shown in the lower left of fig. 1):
wherein the method comprises the steps ofThe semantic segmentation labels with dimensions H W are represented.
1.2 Inputting an input image and an input text without semantic segmentation labels into a teacher model and a student model simultaneously to obtain semantic segmentation results output by the two models, and calculating an omnibearing supervision loss functionThe detailed process is as follows:
1.2.1 The input image without semantic segmentation labels is subjected to strong enhancement processing and weak enhancement processing simultaneously, and is input into a teacher model together with an input text, and finally the semantic segmentation result of the teacher model is obtained(as shown in the upper left part of FIG. 1),>except for about>The rest is set with +.>Is consistent with the calculation formula of (2).
1.2.2 Semantic segmentation results obtained for 1.2.1After pseudo tag screening and filtering (detailed in 1.3), the loss function of the omnibearing supervision part is calculated +.>:
Wherein the method comprises the steps ofAnd the pseudo tag with the dimension of H multiplied by W is formed by filtering and filtering the semantic segmentation result target output by the teacher model.
1.3 The semantic segmentation result output by the teacher model is screened and filtered (as shown in the upper right part of fig. 1) by utilizing the omnibearing labels (such as dots, grass coatings and frames) of the manually marked images, and finally pseudo labels are generated and provided for the student model to train:
wherein the method comprises the steps ofOmnidirectional label representing manually marked image +.>Representing semantic segmentation results output by the teacher model, +.>The method for Active pseudo tag screening (as shown in the right part of fig. 1) actively screens pseudo tags needing to participate in training a model by using information such as tag positions, threshold values and the like in an Active Learning (Active Learning) mode, continuously eliminates low-quality pseudo tags in an iteration process, thereby improving the quality of the screened pseudo tags, and reducing the probability of the model having an over-fitting condition in the iteration processThe rate.
1.3.1 At the position ofThe formula of the pseudo tag of the screening point by the active pseudo tag screening method is defined as follows:
wherein the method comprises the steps ofRepresenting pseudo tags selected by dot tag, < +.>Coordinate information representing a point tag, +.>Representing semantic segmentation regions before screening, +.>And representing the intersection part of the selected point tag and the semantic segmentation area as a result of pseudo tag screening.
1.3.2 At the position ofThe formula of the method is defined as follows:
wherein the method comprises the steps ofIndicating pseudo tags selected by grass-coated tags, < - > or->Representing the pixels occupied by the grass-coated label, < >>Representing semantic segmentation regions before screening, +.>The union part of the sketched label and the semantic segmentation area is selected as a result of the pseudo label screening.
1.3.3 At the position ofThe formula of the pseudo tag of the screening frame by the active pseudo tag screening method is defined as follows:
wherein the method comprises the steps ofRepresenting pseudo tags selected by box tags, < +.>Representing frame tag information,/-, for>When the value of (1) is 0, it is out of the box, and when it is 1, it is in the box,/->Representing semantic segmentation regions before screening, +.>And->Respectively representing the length and width of the frame, +.>Representing a preset threshold value, default to 0.2 +.>Representing selection of a semantically partitioned region that exists only within the target frame and that occupies pixels at a ratio to pixels occupied by the frame that is greater thanThreshold as a result of pseudo tag screening.
1.4 For a particular semantic segmentation task, its loss function is calculated using the results from 1.1 and 1.2 (as shown in the lower right part of fig. 1):
wherein the method comprises the steps ofRepresenting the hyper-parameters that adjust the weight of the omnibearing monitor loss function.
1.5 Updating the weights of the teacher model by an exponential moving average algorithm (EMA) (as shown to the left in fig. 1):
wherein the method comprises the steps ofRepresent the firstkWeights of teacher model at next iteration, +.>Represent the firstkThe weight of the student model at the time of the iteration,representing the update coefficients.
2. Model training process:
2.1 training model:
using a ResNet model as a visual encoder, a language coding part of a CLIP model as a language encoder, and a decoding part of a deep Labv3+ model as a decoder to calculate a loss function in a teacher-student model framework facing an omnibearing supervision semantic segmentation method based on a prompt text; providing entered hint text using three data sets of Pascal VOC 2012, cityscapes and MS COCOAblation experiments and omnibearing supervision training are carried out. The Pascal VOC 2012 includes 10582 training samples, the Cityscapes includes 2975 high-resolution training samples and 500 verification samples, the MS COCO includes about 11.8 ten thousand training samples and about 5 thousand verification samples, and the prompt text data is randomly generated by counting the information of the number, the azimuth, and the like of the image targets in the data set.
2.2 Model training parameter setting:
in the training process, we use Adam optimizer and set initial learning rate to 0.0001 and update coefficientSetting 0.9996 as well as positive threshold and negative threshold of probability of outputting semantic segmentation result by teacher model as 0.7 and 0.3 respectively, and comprehensively supervising super parameter of loss function weight ++>Set to 1, the batch size of the dataset set to 64, and the training iteration round set to 40. In the model hyper-parameters, both the image width and height H and W are set to 480 by default, the channel number C to 768 by default, and the text length T to 40 by default.
The above examples and drawings are not intended to limit the form or form of the present invention, and any suitable variations or modifications thereof by those skilled in the art should be construed as not departing from the scope of the present invention.
Claims (10)
1. The full-scope supervision semantic segmentation method based on the prompt text and based on a teacher-student model framework of a semi-supervision computer vision direction comprises a teacher model and a student model, and the training model is supervised by utilizing an image full-scope label of manual annotation, and is characterized by comprising the following steps:
step 1, calculating a loss function of a teacher-student model framework during omnibearing supervision
Wherein the method comprises the steps ofLoss function representing the total supervision part between the semantic segmentation result output by the computational student model and the semantic segmentation label, +.>Loss function representing an omnibearing supervision part between semantic segmentation results output by a calculation student model and a teacher model,/>Indicating the super-parameters for adjusting the weight of the omnibearing supervision loss function;
the calculation formula of (2) is
The calculation formula of (2) is
Wherein the method comprises the steps ofWeights representing student model +.>Representing an input image with semantic segmentation tags, +.>Representing the word byInput text consisting of strings, < >>Semantic segmentation label representing an input image, +.>Representing the semantic segmentation result of the student model output, +.>Pseudo tag which is formed by filtering and screening semantic segmentation results output by a teacher model, and ++>The calculation formula of (2) is
Wherein the method comprises the steps ofOmnidirectional label representing manually marked image +.>Representing semantic segmentation results output by the teacher model, +.>A method for representing active pseudo tag screening;
step 2, updating the weight of the teacher model through an index moving average algorithm
Wherein the method comprises the steps ofRepresent the firstkThe weight of the teacher model at the time of the iteration,/>represent the firstkWeights of student model at the time of iteration, +.>Representing the update coefficients.
2. The hint text-based omnidirectional supervised semantic segmentation method as set forth in claim 1, wherein:
in the step (1) of the above-mentioned process,the calculation formula of (2) is composed of the following formula
Wherein the method comprises the steps ofRepresenting a decoder model; />Representing the feature matrix after multi-mode fusion and taking the feature matrix as the input of a decoder model; />Representing a multi-modal fusion model; />And->Respectively representing the results of the image feature matrix and the text feature matrix through linear projection, and taking the results as the input of the multi-mode fusion model; />Representing a linear projection layer with the aim of making +.>Andthe number of channels is kept consistent; />Representing an image feature matrix output by the visual encoder model; />Representing a text feature matrix output by the text encoder model; />Representing a visual encoder model; />Representing a text encoder model;
except for about>The rest is set with +.>Is consistent with the calculation formula of (2).
3. The hint text-based omnidirectional supervised semantic segmentation method as set forth in claim 1, wherein:
in the step 1, a ResNet model is used as a visual encoder, a text encoding part of a CLIP model is used as a text encoder, a ViT model is subjected to multi-mode fusion, and a decoding part of a DeepLabv3+ model is used as a decoder to calculate a loss function in semantic segmentation in a teacher model and a student modelProviding entered prompt text +.>Ablation experiments and omnibearing supervision training are carried out.
4. The hint text-based omnidirectional supervised semantic segmentation method as set forth in claim 1, wherein:
in the step 1, the super parameter is calculatedSet to 1.
5. The hint text-based omnidirectional supervised semantic segmentation method as set forth in claim 1, wherein:
in the step 1, a text is inputText encoder requiring positive and negative information for each word to be fed back to teacher model and student model is defined as follows
Wherein the method comprises the steps ofRepresenting the input text +.>Positive and negative of->Representing the input vocabulary.
6. The hint text-based omnidirectional supervised semantic segmentation method as set forth in claim 1, wherein:
in the step 1, the manually marked image omnibearing label comprises points, grass coating and frames.
7. The hint text-based omnidirectional supervised semantic segmentation method of claim 6, wherein:
in the step 1, the process of screening the pseudo tag of the point by the active pseudo tag screening method is defined as follows
Wherein the method comprises the steps ofRepresenting pseudo tags selected by dot tag, < +.>Coordinate information representing a point tag, +.>Representing semantic segmentation regions before screening, +.>And representing the intersection part of the selected point tag and the semantic segmentation area as a result of pseudo tag screening.
8. The hint text-based omnidirectional supervised semantic segmentation method of claim 6, wherein:
in the step 1, the process of screening the grass-coated pseudo tag by the active pseudo tag screening method is defined as follows
Wherein the method comprises the steps ofIndicating pseudo tags selected by grass-coated tags, < - > or->Representing the pixels occupied by the grass-coated label, < >>Representing semantic segmentation regions before screening, +.>The union part of the sketched label and the semantic segmentation area is selected as a result of the pseudo label screening.
9. The hint text-based omnidirectional supervised semantic segmentation method of claim 6, wherein:
in the step 1, the process of screening the pseudo tags of the frame by the active pseudo tag screening method is defined as follows
Wherein the method comprises the steps ofRepresenting pseudo tags selected by box tags, < +.>Representing frame tag information,/-, for>When the value of (1) is 0, it is out of the box, and when it is 1, it is in the box,/->Representing semantic segmentation regions before screening, +.>And->Respectively representing the length and width of the frame, +.>Representing a preset threshold value,/->The selection of a semantically partitioned area that exists only within the target frame is indicated as a result of pseudo tag screening when the ratio of pixels occupied by the area to pixels occupied by the frame is greater than a threshold.
10. The hint text-based omnidirectional supervised semantic segmentation method as set forth in claim 1, wherein:
in the training process of the teacher-student model framework, an Adam optimizer is used, the initial learning rate is set to be 0.0001, and the coefficients are updatedSet to 0.9996, and set the positive and negative thresholds of the teacher model output probability to 0.7 and 0.3, respectively.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410239251.9A CN117830638A (en) | 2024-03-04 | 2024-03-04 | Omnidirectional supervision semantic segmentation method based on prompt text |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410239251.9A CN117830638A (en) | 2024-03-04 | 2024-03-04 | Omnidirectional supervision semantic segmentation method based on prompt text |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117830638A true CN117830638A (en) | 2024-04-05 |
Family
ID=90523146
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410239251.9A Pending CN117830638A (en) | 2024-03-04 | 2024-03-04 | Omnidirectional supervision semantic segmentation method based on prompt text |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117830638A (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114140390A (en) * | 2021-11-02 | 2022-03-04 | 广州大学 | Crack detection method and device based on semi-supervised semantic segmentation |
US20220156593A1 (en) * | 2020-11-16 | 2022-05-19 | Salesforce.Com, Inc. | Systems and methods for video representation learning with a weak teacher |
US20230093619A1 (en) * | 2021-09-17 | 2023-03-23 | Uif (University Industry Foundation), Yonsei University | Weakly supervised semantic segmentation device and method based on pseudo-masks |
CN115861164A (en) * | 2022-09-16 | 2023-03-28 | 重庆邮电大学 | Medical image segmentation method based on multi-field semi-supervision |
CN116993975A (en) * | 2023-07-11 | 2023-11-03 | 复旦大学 | Panoramic camera semantic segmentation method based on deep learning unsupervised field adaptation |
CN117058024A (en) * | 2023-08-04 | 2023-11-14 | 淮阴工学院 | Transformer-based efficient defogging semantic segmentation method and application thereof |
CN117237648A (en) * | 2023-11-16 | 2023-12-15 | 中国农业科学院农业资源与农业区划研究所 | Training method, device and equipment of semantic segmentation model based on context awareness |
-
2024
- 2024-03-04 CN CN202410239251.9A patent/CN117830638A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20220156593A1 (en) * | 2020-11-16 | 2022-05-19 | Salesforce.Com, Inc. | Systems and methods for video representation learning with a weak teacher |
US20230093619A1 (en) * | 2021-09-17 | 2023-03-23 | Uif (University Industry Foundation), Yonsei University | Weakly supervised semantic segmentation device and method based on pseudo-masks |
CN114140390A (en) * | 2021-11-02 | 2022-03-04 | 广州大学 | Crack detection method and device based on semi-supervised semantic segmentation |
CN115861164A (en) * | 2022-09-16 | 2023-03-28 | 重庆邮电大学 | Medical image segmentation method based on multi-field semi-supervision |
CN116993975A (en) * | 2023-07-11 | 2023-11-03 | 复旦大学 | Panoramic camera semantic segmentation method based on deep learning unsupervised field adaptation |
CN117058024A (en) * | 2023-08-04 | 2023-11-14 | 淮阴工学院 | Transformer-based efficient defogging semantic segmentation method and application thereof |
CN117237648A (en) * | 2023-11-16 | 2023-12-15 | 中国农业科学院农业资源与农业区划研究所 | Training method, device and equipment of semantic segmentation model based on context awareness |
Non-Patent Citations (2)
Title |
---|
JIAMU SUN ET AL.: "RefTeacher: A Strong Baseline for Semi-Supervised Referring Expression Comprehension", 《2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION》, 22 August 2023 (2023-08-22), pages 19144 - 19151 * |
滕国龙: "基于半监督学习的实时语义分割算法研究与应用", 《中国优秀硕士学位论文全文数据库》, no. 02, 15 February 2023 (2023-02-15), pages 20 - 72 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110097131B (en) | Semi-supervised medical image segmentation method based on countermeasure cooperative training | |
Jiang et al. | Scfont: Structure-guided chinese font generation via deep stacked networks | |
Dvornik et al. | On the importance of visual context for data augmentation in scene understanding | |
CN109299274B (en) | Natural scene text detection method based on full convolution neural network | |
CN112887698B (en) | High-quality face voice driving method based on nerve radiation field | |
CN111738251B (en) | Optical character recognition method and device fused with language model and electronic equipment | |
CN108874174A (en) | A kind of text error correction method, device and relevant device | |
CN111723585A (en) | Style-controllable image text real-time translation and conversion method | |
CN113158862B (en) | Multitasking-based lightweight real-time face detection method | |
CN107251059A (en) | Sparse reasoning module for deep learning | |
CN110533024B (en) | Double-quadratic pooling fine-grained image classification method based on multi-scale ROI (region of interest) features | |
CN109086768B (en) | Semantic image segmentation method of convolutional neural network | |
CN111737511B (en) | Image description method based on self-adaptive local concept embedding | |
CN111160533A (en) | Neural network acceleration method based on cross-resolution knowledge distillation | |
Zhang et al. | Efficient inductive vision transformer for oriented object detection in remote sensing imagery | |
CN113673338B (en) | Automatic labeling method, system and medium for weak supervision of natural scene text image character pixels | |
CN110880176B (en) | Semi-supervised industrial image defect segmentation method based on countermeasure generation network | |
CN112070114B (en) | Scene character recognition method and system based on Gaussian constraint attention mechanism network | |
CN113807340B (en) | Attention mechanism-based irregular natural scene text recognition method | |
CN114565808B (en) | Double-action contrast learning method for unsupervised visual representation | |
CN111914555A (en) | Automatic relation extraction system based on Transformer structure | |
CN113378949A (en) | Dual-generation confrontation learning method based on capsule network and mixed attention | |
CN115718815A (en) | Cross-modal retrieval method and system | |
Qu et al. | Exploring stroke-level modifications for scene text editing | |
CN111739037A (en) | Semantic segmentation method for indoor scene RGB-D image |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |