CN117690031B - SAM model-based small sample learning remote sensing image detection method - Google Patents
SAM model-based small sample learning remote sensing image detection method Download PDFInfo
- Publication number
- CN117690031B CN117690031B CN202410154598.3A CN202410154598A CN117690031B CN 117690031 B CN117690031 B CN 117690031B CN 202410154598 A CN202410154598 A CN 202410154598A CN 117690031 B CN117690031 B CN 117690031B
- Authority
- CN
- China
- Prior art keywords
- model
- dino
- sample
- image
- remote sensing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 123
- 238000000034 method Methods 0.000 claims abstract description 36
- 230000008569 process Effects 0.000 claims abstract description 25
- 238000002372 labelling Methods 0.000 claims abstract description 24
- 238000000605 extraction Methods 0.000 claims abstract description 14
- 238000012549 training Methods 0.000 claims description 23
- 238000004364 calculation method Methods 0.000 claims description 11
- 238000007781 pre-processing Methods 0.000 claims description 9
- 230000011218 segmentation Effects 0.000 abstract description 17
- 238000012545 processing Methods 0.000 abstract description 3
- 230000000694 effects Effects 0.000 description 8
- 238000006243 chemical reaction Methods 0.000 description 7
- 230000007547 defect Effects 0.000 description 5
- 230000006872 improvement Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 238000003709 image segmentation Methods 0.000 description 3
- 238000004891 communication Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 230000002452 interceptive effect Effects 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 230000008649 adaptation response Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000013139 quantization Methods 0.000 description 1
- 238000012827 research and development Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000013106 supervised machine learning method Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/10—Terrestrial scenes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Multimedia (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Databases & Information Systems (AREA)
- Computing Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a method for detecting a remote sensing image for learning a small sample based on a SAM model, which comprises the following steps: s1, establishing an object sample library, and S2, detecting an initial labeling result and an initial semantic label of a target by utilizing a Grouding Dino model; acquiring image features by using the SAM model; s3, extracting comprehensive feature points, superposing image features, and obtaining an extraction result by the detection model under the guidance of a sample library. The invention uses the SAM model and Grouding Dino model to pre-process a small amount of labeling sample knowledge of the remote sensing image as a characteristic sample, combines with the SAM model identification capacity of everything, has strong structural expandability, obtains a preliminary candidate region by utilizing the SAM model generalization segmentation capacity, provides enough generalization basic characteristics for subsequent processing by using two general target detection models, and improves the detection model generalization capacity.
Description
Technical Field
The invention relates to the technical field of cross-modal detection of remote sensing images, in particular to a method for detecting a small sample learning remote sensing image based on a SAM model.
Background
The remote sensing image target detection is an image detection task of a specific application scene, the existing remote sensing image target detection model is mostly based on ResNet, viTransformer or SwinTransformer and the like, and forms a relatively fixed recognition model after being combined with remote sensing field samples through a large number of supervised training, the detection effect is limited by the number and labeling quality of the samples, the long tail scene is more due to complex background interference and ambiguity and diversity of object edges in the remote sensing image scene, a large number of tag data are needed, the training cost is high, the generalization recognition capability of the traditional model is also more general, only the existing target types in the training samples can be detected based on the supervised training model, if the target types are newly added and recognized, the labeling samples are needed to be retrained, the iteration process is long, and semantic association is not established among different recognition types.
The general target detection Model represented by the SAM Model (SEGMENT ANYTHING Model) is expected to solve the long tail problem of visual target detection and reduce the research and development cost of the customized recognition Model; however, the current general target detection model also has the following disadvantages: firstly, the SAM model is not a completely unsupervised detection model, and relies on manual input of prompt information, such as input prompts of points, planes, areas and the like; secondly, the segmentation of a new sample in the new field lacks guidance of domain image knowledge, so that the segmentation granularity is too fine or too coarse, and the complete boundary of a domain object cannot be accurately defined; furthermore, the segmentation result of the SAM model does not have a semantic tag, and cannot indicate what the segmented image object represents. In order to improve the detection effect of the remote sensing image by utilizing the generalization recognition capability of the general target detection model, sample knowledge in the remote sensing subdivision field is required to be reasonably combined.
The existing method for improving the generalization of remote sensing image target detection based on a general target detection model SAM model comprises the following steps:
One is to train the generic text-image cross-modality target detection directly. Similar to the models of CLIP, grouding Dino and the like, text-image multimodal data training is directly used, the aim is to open any target detection, and any detection target is specified by the input text. The model has strong input generalization due to the diversity of text description, but is used as a general cross-mode target detection model, and has the advantages that the data conversion from text to image mode is realized, the single-sided capability of target detection is not enough outstanding, the target detection accuracy on a remote sensing image is not ideal, and even the effect of the traditional model cannot be achieved.
Secondly, the image features extracted by the SAM model encoder are connected with a remote sensing image target detection and classification module. The scheme mainly utilizes the strong edge extraction capability of the SAM model, and then trains an independent remote sensing field target discrimination and label classification module by using samples in the remote sensing image field, and combines the extraction result of the SAM model and the sample knowledge in the remote sensing field.
And thirdly, the remote sensing image target detection module extracts candidate bounding boxes to be used as prompt information to be input into the SAM model. The scheme starts from the input of a SAM model, a training module of a remote sensing image sample is arranged in front, and a traditional target detection model is used first.
The scheme improves the detection effect of the universal target detection model applied to the field of remote sensing images to a certain extent, but still has obvious defects: in the first scheme, although the specification of any recognition target is supported through a text mode, the generalization support of input is enough, but the knowledge of a remote sensing image cannot be effectively introduced, and the recognition precision is poor. The second and third schemes respectively introduce supervised remote sensing image sample training by using separate modules, so that the recognition accuracy is improved, but the generalization support of input is insufficient, and the mode of combining the SAM model and the supervised training model introduces knowledge contained in the remote sensing image annotation sample, but limits the recognized target type. The model still can only detect the existing target types in the training sample, if the target types are newly identified, the module with supervision training needs to be retrained, and the generalization capability of the general target detection model can not be fully exerted.
The two points are main defects of the existing general target detection model in a remote sensing image application scheme, and influence the accuracy and generalization capability of the existing remote sensing image target detection.
The patent document with the document number of CN116824133A discloses an intelligent interpretation method of a remote sensing image, which relates to the technical field of remote sensing image processing, and comprises the following steps: the encoder module is a vector quantization variation self-encoder and is used for learning to obtain hidden layer characteristic representation; training an encoder module by using the first training set, and jointly training a central area module and a decoder module by using the first training set and the second training set; and acquiring a remote sensing image to be segmented, and inputting a trained semantic segmentation model to obtain a semantic segmentation result. The remote sensing image target classification detection model based on the supervised machine learning method training has a certain improvement on the accuracy of the remote sensing image compared with the original SAM model. But it exists at the same time: the problem of insufficient generalization capability is that the model with supervision training is relatively fixed in detection target type, and the effect is poor when the difference between the test image and the training sample is large.
Disclosure of Invention
The invention aims to provide a small sample learning remote sensing image detection method based on a SAM model, which can improve the generalization and the accuracy of the target detection of the existing remote sensing image.
The aim of the invention can be achieved by the following technical scheme: a method for detecting a remote sensing image of small sample learning based on a SAM model comprises the following steps:
s1, establishing an object sample library, and extracting sample labels and semantic tag features of each sample;
S2, inputting a remote sensing image and text description of a detection target to a detection model by a user, and preprocessing the remote sensing image and the text description by the detection model by using Grouding Dino model to obtain an initial labeling result and an initial semantic label of the detection target; preprocessing a remote sensing image by using the SAM model to obtain image characteristics;
S3, the detection model utilizes the initial labeling result and the initial semantic label of the detection target to be compared with the sample labeling and semantic label of the sample, comprehensive feature points are extracted, then image features obtained by the SAM model are overlapped to serve as prompt signals for further target detection of the SAM model, and the detection model obtains the extraction result under the guidance of a sample library.
Further: in the step S2, the Grouding Dino model is utilized to preprocess the input remote sensing image and the text description of the detection target, and the process formula is as follows:
(1)
(2)
(3)
Wherein I is a remote sensing image input by a user, text is a detection target text description, Φ dino-img-enc represents an image encoder of Grouding Dino, F img-dino is an output quantity of Φ dino-img-enc, Φ dino-text-enc represents a text encoder of Grouding Dino, F text is an output quantity of Φ dino-text-enc, Φ dino-dec represents a decoder of Grouding Dino, M dino is a detection target initial labeling result, and Lable dino is a detection target initial semantic tag.
Further: the SAM model in S2 preprocesses the input remote sensing image, and the process formula is as follows:
(4)
Wherein Φ i-enc is a SAM model pre-training image encoder, I is a user input remote sensing image, and F img is an image feature.
Further: in the step S3, the step of extracting the comprehensive feature points is as follows:
S3A1, performing space multiplication by using the sample label M R and the sample image characteristic F R, and calculating to obtain the distribution similarity of the sample label M R and the sample image characteristic F R ;
S3A2, distribution similarity from sample imagesPerforming cosine similarity cosine calculation with the image feature F to obtain probability S that each pixel of the remote sensing image accords with the target mark;
S3A3, obtaining the highest confidence coefficient P high and the lowest confidence coefficient P low of each pixel of the input remote sensing image according to the probability S;
S3A4, extracting comprehensive feature points by using a promt encoder phi p-enc of the SAM model according to the highest confidence coefficient P high and the lowest confidence coefficient P low, wherein the formula is as follows:
(5)
Wherein Φ p-enc is a promt encoder of the SAM model, and T prompt is a comprehensive feature point.
Further: and S3, overlapping image features as further prompt signals for SAM model target detection, wherein the steps are as follows:
S3B1, combining the detection target initial labeling results M dino obtained by the Grouding Dino model, and obtaining an M dino,LableM obtaining process formula of Lable M with the highest degree of correlation according to the degree of correlation between initial semantic tags Lable dino and Lable R of each initial labeling result M dino, wherein the formula is as follows:
(6)
Wherein sim is the Chinese word vector correlation calculation;
S3B2 and M dino are extracted by a promt encoder of the SAM model to obtain T dino, and the process formula is as follows:
= (7)
S3B3, T dino and the comprehensive feature point T prompt are overlapped and are jointly used as a prompt of a SAM model to be input into a decoder phi m-dec for decoding, so that the detection accuracy of the SAM model is improved, and the process formula is as follows:
(8)
Wherein T out is a prompt super parameter of the SAM model for boundary extraction, and is a fixed value;
S3B4, repeating the steps S3B 1-S3B 3 for N times to obtain the final output of the detection model, wherein the process formula is as follows:
(9)
Wherein r is E [1, N ], N is a positive integer, N is a number of custom samples, The value of T dino at the r-th time,The value of T prompt at the r-th time.
Further: the probability S is obtained by calculating a plurality of channels of the image, and the process formula is as follows:
(10)
(11)
(12)
Where i.e.1, n represents a plurality of channels of the image, For distribution similarity, M R is sample labeling and F R is sample image feature, S i is the result value of the ith channel, and S is the result mean of multiple channels.
Further: and in the S1, an object sample library mode is established, wherein the mode comprises user-defined sample adding.
The invention has the beneficial effects that:
1. According to the invention, through the application of a general target detection model SAM model and a text-image feature conversion model Grouding Dino, the conversion features from a text mode to an image mode are obtained by utilizing Grouding Dino, a preliminary candidate region is obtained by utilizing the generalized segmentation capability of the SAM model, and meanwhile, a small amount of annotation sample knowledge of a remote sensing image is preprocessed to be used as a feature sample, so that the feature sample is combined with the capability of the SAM model for identifying everything, and on one hand, the defect of the application of the SAM model in the remote sensing geographic image field is overcome, and a semantic label is added for an identification object; the text modal input expands the object description recognition capability, the structure expandability is strong, the primary candidate region is obtained by utilizing the generalization segmentation capability of the SAM model, the two general target detection models provide basic features of sufficient generalization for subsequent processing, and the overall generalization capability of the detection models is improved.
2. According to the invention, a small quantity of annotation sample libraries of the remote sensing image are constructed, image features in the annotation sample of the remote sensing image are preprocessed to be used as feature samples, so that the defect of application of the SAM model in the field of remote sensing geographic images is overcome, the identification accuracy of a basic model of the SAM model in the field of remote sensing images is improved through the newly added subdivision sample libraries, and semantic tags are added for identification objects. Meanwhile, fine adjustment is not needed for samples in the subdivision field introduced by the scheme, so that a user is allowed to introduce new samples in real time, the generalization capability of the model is reserved while the accuracy of the model is improved, and the target detection accuracy is improved.
3. The small sample self-adaptive detection model structure designed by the invention can realize improvement of the segmentation result of the SAM model under the condition that model training is not needed, and combines the text-image conversion capability of Grouding Dino with the self-defined sample, and the output result of the scheme also has semantic tags of each identification object, so that the practicability of the universal target detection model in the field of application can be obviously improved.
4. The model structure provided by the invention can be used in the subdivision fields such as remote sensing images and the like, improves the generalization segmentation capability of the SAM model, can conveniently carry out knowledge of samples in the subdivision fields, is also suitable for any other fields, has strong flexibility and expansibility, and can obviously improve the floor application effect of the universal target detection model in specific fields.
Drawings
FIG. 1 is a schematic diagram of a small sample learning remote sensing image detection method based on a SAM model;
FIG. 2 is a user-added custom sample artwork according to the present invention.
FIG. 3 is a schematic diagram of a user-added custom sample according to the present invention.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar symbols indicate like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are exemplary only for explaining the present invention and are not to be construed as limiting the present invention.
As shown in fig. 1-3, the invention discloses a method for detecting a small sample learning remote sensing image based on a SAM model, which comprises the following steps:
s1, establishing an object sample library, and extracting sample labels and semantic tag features of each sample;
S2, inputting a remote sensing image and a detection target text description into a detection model by a user, and preprocessing the remote sensing image and the detection target text description by the detection model by using a Grouding Dino model to obtain an initial labeling result and an initial semantic label of the detection target; preprocessing a remote sensing image by using the SAM model to obtain image characteristics;
S3, the detection model utilizes an initial labeling result and an initial semantic label of the detection target, and the sample labeling and the semantic label to conduct comprehensive feature point extraction, and superimposes image features obtained by the SAM model to serve as a further prompt signal for detecting the SAM model target, so that the detection model can obtain an extraction result under the guidance of a sample library.
The following is a description of key terms and english abbreviations to which the invention relates:
SAM model: SEGMENT ANYTHING model, namely segmentation of all models, is a generic model that handles image segmentation tasks. Compared with the traditional image segmentation model, the SAM model can extract the boundary of the content in the image based on various input prompts, and can realize better image segmentation effect on the scene which is not seen by the model or is relatively blurred.
Grouding Dino: the method is a text-image cross-mode open-set image target detection model, and can generate corresponding image features and candidate target detection frames according to any text.
The text-remote sensing image cross-mode detection model mainly takes a general target detection model as a base, a remote sensing image sample library is introduced in a plug-and-play mode, the whole model structure diagram is shown in figure 1,
As shown in part B of fig. 1, a sample library of self-defined image samples with sample labels and semantic tags may be predefined, or may be uploaded by a user when in use, where the sample library is a sample library of self-defined image samples with display semantic tags, sample_mask [ 1..n ], each sample contains an original image I R and a sample label M R, and a semantic tag Lable R, and the sample library extraction features refer to the sample notes M R and the semantic tags Lable R based on the original image I R in the sample library.
The method comprises the steps of carrying out characteristic extraction of a custom image sample library example_mask [ 1..N ], wherein a tool for characteristic extraction uses an image encoder phi i-enc of a SAM model, and the process formula is as follows:
[1,N] (13)
For each group of sample images and marked samples in the sample library, calculating the position confidence coefficient distribution value of the sample images, and performing space multiplication by using the sample marks M R and the sample image features F R to obtain the distribution similarity of the sample marks and the sample images ; Then the similarity is distributed by the sample imageAnd performing cosine similarity cosine calculation on the image characteristic F to obtain probability S that each pixel of the input remote sensing image accords with the target mark.
The probability S can be obtained by calculating a plurality of channels of the image, and the process formula is as follows:
(10)
(11)
(12)
wherein i epsilon [1, n ] represents that a plurality of channels of the image participate in the calculation, Is the result average of a plurality of channels.
According to the calculation result of the probability S, two points P high and P low with the highest confidence coefficient and the lowest confidence coefficient are found, the most probable point and the least probable point (namely more probable belonging to background points) of the detection target are respectively represented, the confidence coefficient value corresponding to P high can be set to be larger than a specified threshold value of 0.9, the two points P high and P low are used as positive and negative prompt points of the SAM model, and the SAM model is guided to correctly identify whether the detection target exists in the image. If the confidence level corresponding to P high is less than the threshold value of 0.9, it may be assumed that the sample is not present in the input remote sensing image.
Two points P high and P low are taken as positive and negative prompt points of the SAM model, P high and P low are taken as point type prompt (prompt) of the SAM model, and the input prompt encoder phi p-enc is used for extracting coding features, and the process formula is as follows:
(5)
Wherein Φ p-enc is a prompt encoder of the SAM model, T prompt is a comprehensive feature point, and the comprehensive feature point is used for prompting the next step to guide the SAM model to extract the final key result.
When the method is used, as shown in a part A in fig. 1, a user inputs a remote sensing image and a detection target text description to a detection model, and the detection model utilizes Grouding Dino models to preprocess the input remote sensing image and the detection target text description to obtain an initial labeling result M dino and an initial semantic label Lable dino of the detection target;
grounding Dino is an open set target detector supporting text-image multi-mode input, and can output multiple pairs of candidate region regions for input remote sensing image (image, I) -detection target text description (text) input by a user, wherein each pair is marked with an initial marking result And corresponding semantic tagsGrounding Dino includes three main components: image encoder, text encoder and decoder, the process can be expressed as:
(1)
(2)
(3)
Wherein I is a remote sensing image input by a user, text is a detection target text description, Φ dino-img-enc represents an image encoder of Grouding Dino, F img-dino is an output quantity of Φ dino-img-enc, Φ dino-text-enc represents a text encoder of Grouding Dino, F text is an output quantity of Φ dino-text-enc, Φ dino-dec represents a decoder of Grouding Dino, M dino is a detection target initial labeling result, and Lable dino is a detection target initial semantic tag.
The image features are obtained by preprocessing an input remote sensing image with a SAM model, which is an interactive segmentation framework that generates segmentation results based on a given prompt (prompt) (e.g., foreground/background point, region bounding box region, or mask), and comprises three main components: an image encoder, a sympt encoder, and a mask decoder.
The SAM model carries out a preprocessing process formula for the input remote sensing image as follows:
(4)
Wherein Φ i-enc is a SAM model pre-training image encoder, I is a user input remote sensing image, and F img is an image feature. The SAM model processes the image into intermediate image features F img using a Vision Transformer (ViT) based pre-training image encoder Φ i-enc.
The input remote sensing image I candidate region Grouding Dino labels, namely M dino and Lable dino, are obtained through pretreatment of Grouding Dino and a SAM model on the input remote sensing image I, and meanwhile, image features F img are obtained based on the SAM model.
Further, as shown in part C of fig. 1, the M dino and Lable dino obtained by preprocessing are combined with the sample features extracted from the sample library to perform comprehensive feature extraction, and the combined features are overlapped to serve as a further prompt signal for SAM model target detection, so that an extraction result under the guidance of the sample library is finally obtained, including the region of the recognition target object and the corresponding semantic label.
Specifically, in combination with the multiple candidate region (region) types M dino obtained by the Grouding Dino model, according to the correlation calculation between the semantic tag Lable dino and the sample tag Lable R of each candidate region (region) type M dino, the correlation calculation can use chinese word vector correlation calculation, where the process formula is:
(6)
wherein, And calculating the relevance of the Chinese word vector.
According to Lable M, M dino,Mdino of a candidate rectangular region (region) most relevant to the acquisition threshold is extracted by a sample encoder of a SAM model to obtain T dino, and the process formula is as follows:
= (7)
T dino is overlapped with T prompt after the point prompt information is encoded, and the point prompt information is jointly used as a prompt of a SAM model to be decoded by a decoder phi m-dec of the SAM model, so that the positioning accuracy of the SAM model detection can be further improved, and the process formula is as follows:
(8)
repeating the steps for N times, respectively calculating for all the provided N custom samples, and finally identifying all the detection target areas which possibly exist:
(9)
Wherein r is E [1, N ], N is a positive integer, N is a number of custom samples, The value of T dino at the r-th time,The value of T prompt at the r-th time.
The following description of the embodiment application is made:
referring to fig. 1, a user inputs a remote sensing image to be detected as an input remote sensing image, and the input remote sensing image is used for detecting a target text description of "extracting sports fields such as football fields and basketball fields in the drawing".
The detection model is firstly extracted through Grouding Dino to obtain M dino, wherein the M dino comprises initial labeling results M dino of 3 candidate areas (regions) and semantic tags, and the semantic tags are respectively a football field, a basketball field and a sports field.
The existing sample library has 10 related detection targets, and semantic tags are respectively "football field", "basketball field", "plastic track", "building", "tree", "road", "storage tank", "airplane", "automobile" and "ship".
The semantic tags of the 3 candidate areas M dino are found to be the most similar to tags of 3 target samples of football court, basketball court and plastic track in the sample library through calculation, so that the maximum probability feature points and background points corresponding to the 3 samples in the input picture are extracted respectively.
And then 3 candidate areas in M dino are respectively combined with the corresponding characteristic points and the background point characteristics for 3 times and used as prompt information input of the SAM model, and finally, 3 target detection results under the guidance of the sample library are obtained. Namely basketball court, plastic course, football court detection area marked with wire frame in part C in figure 1.
The establishment and update of the sample library can comprise user-defined sample addition, as shown in fig. 2, when a user finds that the existing identification type of the sample library cannot meet the detection requirement and wants to add the identification target of the 'roundabout intersection', the user only needs to upload an image marked with the detection target, expand the sample library, and the detection model can mark whether the similar detection target exists in the newly input remote sensing image or not as shown in fig. 3.
The invention discloses a method for realizing a model for detecting a text custom image target in the field of remote sensing images, which combines a general image target detection basic model and a new sample in the field of remote sensing, wherein a user can designate a detection target in any remote sensing image through natural language, and the model outputs an identified object region in the remote sensing image and gives a semantic label; the general image target detection basic model comprises a text-image feature conversion model Grouding Dino and a general target detection model SAM model, the conversion features from a text mode to an image mode are obtained by utilizing Grouding Dino, a preliminary candidate region is obtained by utilizing the generalized segmentation capability of the SAM model, a small amount of marked sample knowledge of a remote sensing image is preprocessed to serve as a feature sample, the feature sample is combined with the capability of the SAM model for identifying everything, on one hand, the defect of the application of the SAM model in the remote sensing geographic image field is overcome, the identification accuracy of the SAM model basic model in the remote sensing image field is improved by adding a sub-sample library, and a semantic label is added for identifying objects; on the other hand, the text mode input of the invention expands the description capability of the recognition object, has strong structural expandability, allows a user to customize a new sample and set a special detection target, realizes the rapid adaptation of the newly added recognition target, and expands the application scene of the SAM model.
Meanwhile, in order to fully utilize the recognition capability of the SAM model segmentation everywhere to improve the generalization of remote sensing image target detection and support the free description of detection requirements of users in the form of natural language, the small sample self-adaptive model structure designed by the invention can realize the improvement of the segmentation result of the SAM model under the condition that model retraining is not needed, and combines the text-image conversion capability of Grouding Dino and the self-defined samples, the semantic tag of each recognition object is also arranged in the output result of the scheme, and the model structure provided by the invention can utilize and improve the generalization segmentation capability of the SAM model in the subdivision field of remote sensing images and the like, better supports the diversified detection requirements of users, better fuses the sample knowledge of the remote sensing images, realizes the improvement of detection effects with low resources, has high adaptive response speed and stronger generalization, and can obviously improve the intelligent experience of the interactive query of the remote sensing images.
The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.
It is to be understood that the terms "center," "longitudinal," "transverse," "length," "width," "thickness," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," "clockwise," "counter-clockwise," "axial," "radial," "circumferential," and the like are directional or positional relationships as indicated based on the drawings, merely to facilitate describing the invention and to simplify the description, and do not indicate or imply that the devices or elements referred to must have a particular orientation, be configured and operated in a particular orientation, and therefore should not be construed as limiting the invention.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present invention, the meaning of "plurality" means at least two, for example, two, three, etc., unless specifically defined otherwise.
In the present invention, unless explicitly specified and limited otherwise, the terms "mounted," "connected," "secured," and the like are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally formed; may be mechanically connected, may be electrically connected or may be in communication with each other; either directly or indirectly, through intermediaries, or both, may be in communication with each other or in interaction with each other, unless expressly defined otherwise. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art according to the specific circumstances.
In the present invention, unless expressly stated or limited otherwise, a first feature "up" or "down" a second feature may be the first and second features in direct contact, or the first and second features in indirect contact via an intervening medium. Moreover, a first feature being "above," "over" and "on" a second feature may be a first feature being directly above or obliquely above the second feature, or simply indicating that the first feature is level higher than the second feature. The first feature being "under", "below" and "beneath" the second feature may be the first feature being directly under or obliquely below the second feature, or simply indicating that the first feature is less level than the second feature.
Claims (5)
1. The method for detecting the remote sensing image by learning the small sample based on the SAM model is characterized by comprising the following steps:
s1, establishing an object sample library, and extracting sample labels and semantic tag features of each sample;
S2, inputting a remote sensing image and text description of a detection target to a detection model by a user, and preprocessing the remote sensing image and the text description by the detection model by using Grounding Dino model to obtain an initial labeling result and an initial semantic label of the detection target; preprocessing a remote sensing image by using the SAM model to obtain image characteristics;
s3, the detection model utilizes an initial labeling result and an initial semantic label of a detection target to be compared with a sample labeling and semantic label of a sample to extract comprehensive feature points, and then image features acquired by the SAM model are overlapped to serve as prompt signals for further target detection of the SAM model, so that the detection model acquires an extraction result under the guidance of a sample library;
in the step S3, the step of extracting the comprehensive feature points is as follows:
S3A1, performing space multiplication by using the sample label M R and the sample image characteristic F R, and calculating to obtain the distribution similarity of the sample label M R and the sample image characteristic F R ; Wherein the sample image feature F R is acquired by the image encoder Φ i-enc using the SAM model;
S3A2, distribution similarity from sample images Performing cosine similarity cosine calculation with the image feature F img to obtain probability S that each pixel of the remote sensing image accords with the target mark;
S3A3, obtaining the highest confidence coefficient P high and the lowest confidence coefficient P low of each pixel of the input remote sensing image according to the probability S;
S3A4, extracting comprehensive feature points by using a promt encoder phi p-enc of the SAM model according to the highest confidence coefficient P high and the lowest confidence coefficient P low, wherein the formula is as follows:
(5)
Wherein phi p-enc is a promt encoder of the SAM model, and T prompt is a comprehensive feature point;
And S3, overlapping image features as further prompt signals for SAM model target detection, wherein the steps are as follows:
S3B1, combining the detection target initial labeling results M dino obtained by the Grounding Dino model, and obtaining an M dino,LableM obtaining process formula of Lable M with the highest degree of correlation according to the degree of correlation between initial semantic tags Lable dino and Lable R of each initial labeling result M dino, wherein the formula is as follows:
(6)
Wherein sim is the Chinese word vector correlation calculation; lable R is a semantic tag;
S3B2 and M dino are extracted by a promt encoder of the SAM model to obtain T dino, and the process formula is as follows:
(7)
S3B3, T dino and the comprehensive feature point T prompt are overlapped and are jointly used as a prompt of a SAM model to be input into a decoder phi m-dec for decoding, so that the detection accuracy of the SAM model is improved, and the process formula is as follows:
(8)
Wherein T out is a prompt super parameter of the SAM model for boundary extraction, and is a fixed value;
S3B4, repeating the steps S3B 1-S3B 3 for N times to obtain the final output of the detection model, wherein the process formula is as follows:
(9)
Wherein r is E [1, N ], N is a positive integer, N is a number of custom samples, T dino value of the r-th time,/>The value of T prompt at the r-th time.
2. The SAM model-based small sample learning remote sensing image detection method of claim 1, wherein: in the step S2, the Grounding Dino model is utilized to preprocess the input remote sensing image and the text description of the detection target, and the process formula is as follows:
(1)
(2)
(3)
Wherein I is a remote sensing image input by a user, text is a detection target text description, Φ dino-img-enc represents an image encoder of Grounding Dino, F img-dino is an output quantity of Φ dino-img-enc, Φ dino-text-enc represents a text encoder of Grounding Dino, F text is an output quantity of Φ dino-text-enc, Φ dino-dec represents a decoder of Grounding Dino, M dino is a detection target initial labeling result, and Lable dino is a detection target initial semantic tag.
3. The SAM model-based small sample learning remote sensing image detection method of claim 2, wherein: the SAM model in S2 preprocesses the input remote sensing image, and the process formula is as follows:
(4)
Wherein Φ i-enc is a SAM model pre-training image encoder, I is a user input remote sensing image, and F img is an image feature.
4. The SAM model-based small sample learning remote sensing image detection method of claim 3, wherein: the probability S is obtained by calculating a plurality of channels of the image, and the process formula is as follows:
(10)
(11)
(12)
Where i.e.1, n represents a plurality of channels of the image, For distribution similarity, M R is sample labeling and F R is sample image feature, S i is the result value of the ith channel, and S is the result mean of multiple channels.
5. The SAM model-based small sample learning remote sensing image detection method of claim 1, wherein: and in the S1, an object sample library mode is established, wherein the mode comprises user-defined sample adding.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410154598.3A CN117690031B (en) | 2024-02-04 | 2024-02-04 | SAM model-based small sample learning remote sensing image detection method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410154598.3A CN117690031B (en) | 2024-02-04 | 2024-02-04 | SAM model-based small sample learning remote sensing image detection method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117690031A CN117690031A (en) | 2024-03-12 |
CN117690031B true CN117690031B (en) | 2024-04-26 |
Family
ID=90130410
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410154598.3A Active CN117690031B (en) | 2024-02-04 | 2024-02-04 | SAM model-based small sample learning remote sensing image detection method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117690031B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118097289B (en) * | 2024-03-15 | 2024-09-20 | 华南理工大学 | Open world target detection method based on visual large model enhancement |
CN117952993B (en) * | 2024-03-27 | 2024-06-18 | 中国海洋大学 | Semi-supervised medical image segmentation method based on image text cooperative constraint |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023077816A1 (en) * | 2021-11-03 | 2023-05-11 | 中国华能集团清洁能源技术研究院有限公司 | Boundary-optimized remote sensing image semantic segmentation method and apparatus, and device and medium |
CN116775922A (en) * | 2023-05-16 | 2023-09-19 | 中国航空综合技术研究所 | Remote sensing image cross-modal retrieval method based on fusion of language and visual detail characteristics |
CN116824307A (en) * | 2023-08-29 | 2023-09-29 | 深圳市万物云科技有限公司 | Image labeling method and device based on SAM model and related medium |
CN116912663A (en) * | 2023-07-20 | 2023-10-20 | 同济大学 | Text-image detection method based on multi-granularity decoder |
CN116994140A (en) * | 2023-08-14 | 2023-11-03 | 航天宏图信息技术股份有限公司 | Cultivated land extraction method, device, equipment and medium based on remote sensing image |
CN117197609A (en) * | 2023-08-25 | 2023-12-08 | 中国自然资源航空物探遥感中心 | Construction method, system, medium and equipment of remote sensing sample data set |
CN117275025A (en) * | 2023-11-01 | 2023-12-22 | 北京道仪数慧科技有限公司 | Processing system for batch image annotation |
-
2024
- 2024-02-04 CN CN202410154598.3A patent/CN117690031B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023077816A1 (en) * | 2021-11-03 | 2023-05-11 | 中国华能集团清洁能源技术研究院有限公司 | Boundary-optimized remote sensing image semantic segmentation method and apparatus, and device and medium |
CN116775922A (en) * | 2023-05-16 | 2023-09-19 | 中国航空综合技术研究所 | Remote sensing image cross-modal retrieval method based on fusion of language and visual detail characteristics |
CN116912663A (en) * | 2023-07-20 | 2023-10-20 | 同济大学 | Text-image detection method based on multi-granularity decoder |
CN116994140A (en) * | 2023-08-14 | 2023-11-03 | 航天宏图信息技术股份有限公司 | Cultivated land extraction method, device, equipment and medium based on remote sensing image |
CN117197609A (en) * | 2023-08-25 | 2023-12-08 | 中国自然资源航空物探遥感中心 | Construction method, system, medium and equipment of remote sensing sample data set |
CN116824307A (en) * | 2023-08-29 | 2023-09-29 | 深圳市万物云科技有限公司 | Image labeling method and device based on SAM model and related medium |
CN117275025A (en) * | 2023-11-01 | 2023-12-22 | 北京道仪数慧科技有限公司 | Processing system for batch image annotation |
Non-Patent Citations (2)
Title |
---|
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection;Shilong Liu, et al.;arXiv:2303.05499v4 [cs.CV];20230320;全文 * |
基于深度学习的自然场景文本检测与识别综述;王建新等;软件学报;20200515(05);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN117690031A (en) | 2024-03-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN117690031B (en) | SAM model-based small sample learning remote sensing image detection method | |
AU2021103625A4 (en) | Remote sensing image semantic segmentation method based on contrastive self-supervised learning | |
CN109949317B (en) | Semi-supervised image example segmentation method based on gradual confrontation learning | |
CN111860348A (en) | Deep learning-based weak supervision power drawing OCR recognition method | |
Shahab et al. | ICDAR 2011 robust reading competition challenge 2: Reading text in scene images | |
CN110807434B (en) | Pedestrian re-recognition system and method based on human body analysis coarse-fine granularity combination | |
CN106650731B (en) | Robust license plate and vehicle logo recognition method | |
CN108090906B (en) | Cervical image processing method and device based on region nomination | |
WO2023083280A1 (en) | Scene text recognition method and device | |
CN114022793B (en) | Optical remote sensing image change detection method based on twin network | |
CN110598733A (en) | Multi-label distance measurement learning method based on interactive modeling | |
Sun et al. | Robust text detection in natural scene images by generalized color-enhanced contrasting extremal region and neural networks | |
CN107301414B (en) | Chinese positioning, segmenting and identifying method in natural scene image | |
Qiao et al. | Gaussian constrained attention network for scene text recognition | |
CN101365072A (en) | Subtitle region extracting device and method | |
CN112818951A (en) | Ticket identification method | |
CN113435443B (en) | Method for automatically identifying landmark from video | |
CN112070174A (en) | Text detection method in natural scene based on deep learning | |
CN111507353B (en) | Chinese field detection method and system based on character recognition | |
CN115116074A (en) | Handwritten character recognition and model training method and device | |
CN113221906A (en) | Image sensitive character detection method and device based on deep learning | |
CN111461162B (en) | Zero-sample target detection model and establishing method thereof | |
CN112257716A (en) | Scene character recognition method based on scale self-adaption and direction attention network | |
CN116958512A (en) | Target detection method, target detection device, computer readable medium and electronic equipment | |
CN114882204A (en) | Automatic ship name recognition method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |