CN117690031B

CN117690031B - SAM model-based small sample learning remote sensing image detection method

Info

Publication number: CN117690031B
Application number: CN202410154598.3A
Authority: CN
Inventors: 仲清; 吴恩平; 苏丽萍; 熊兆
Original assignee: Zhongke Xingtu Digital Earth Hefei Co ltd
Current assignee: Zhongke Xingtu Digital Earth Hefei Co ltd
Priority date: 2024-02-04
Filing date: 2024-02-04
Publication date: 2024-04-26
Anticipated expiration: 2044-02-04
Also published as: CN117690031A

Abstract

The invention discloses a method for detecting a remote sensing image for learning a small sample based on a SAM model, which comprises the following steps: s1, establishing an object sample library, and S2, detecting an initial labeling result and an initial semantic label of a target by utilizing a Grouding Dino model; acquiring image features by using the SAM model; s3, extracting comprehensive feature points, superposing image features, and obtaining an extraction result by the detection model under the guidance of a sample library. The invention uses the SAM model and Grouding Dino model to pre-process a small amount of labeling sample knowledge of the remote sensing image as a characteristic sample, combines with the SAM model identification capacity of everything, has strong structural expandability, obtains a preliminary candidate region by utilizing the SAM model generalization segmentation capacity, provides enough generalization basic characteristics for subsequent processing by using two general target detection models, and improves the detection model generalization capacity.

Description

SAM model-based small sample learning remote sensing image detection method

Technical Field

The invention relates to the technical field of cross-modal detection of remote sensing images, in particular to a method for detecting a small sample learning remote sensing image based on a SAM model.

Background

The remote sensing image target detection is an image detection task of a specific application scene, the existing remote sensing image target detection model is mostly based on ResNet, viTransformer or SwinTransformer and the like, and forms a relatively fixed recognition model after being combined with remote sensing field samples through a large number of supervised training, the detection effect is limited by the number and labeling quality of the samples, the long tail scene is more due to complex background interference and ambiguity and diversity of object edges in the remote sensing image scene, a large number of tag data are needed, the training cost is high, the generalization recognition capability of the traditional model is also more general, only the existing target types in the training samples can be detected based on the supervised training model, if the target types are newly added and recognized, the labeling samples are needed to be retrained, the iteration process is long, and semantic association is not established among different recognition types.

The general target detection Model represented by the SAM Model (SEGMENT ANYTHING Model) is expected to solve the long tail problem of visual target detection and reduce the research and development cost of the customized recognition Model; however, the current general target detection model also has the following disadvantages: firstly, the SAM model is not a completely unsupervised detection model, and relies on manual input of prompt information, such as input prompts of points, planes, areas and the like; secondly, the segmentation of a new sample in the new field lacks guidance of domain image knowledge, so that the segmentation granularity is too fine or too coarse, and the complete boundary of a domain object cannot be accurately defined; furthermore, the segmentation result of the SAM model does not have a semantic tag, and cannot indicate what the segmented image object represents. In order to improve the detection effect of the remote sensing image by utilizing the generalization recognition capability of the general target detection model, sample knowledge in the remote sensing subdivision field is required to be reasonably combined.

The existing method for improving the generalization of remote sensing image target detection based on a general target detection model SAM model comprises the following steps:

One is to train the generic text-image cross-modality target detection directly. Similar to the models of CLIP, grouding Dino and the like, text-image multimodal data training is directly used, the aim is to open any target detection, and any detection target is specified by the input text. The model has strong input generalization due to the diversity of text description, but is used as a general cross-mode target detection model, and has the advantages that the data conversion from text to image mode is realized, the single-sided capability of target detection is not enough outstanding, the target detection accuracy on a remote sensing image is not ideal, and even the effect of the traditional model cannot be achieved.

Secondly, the image features extracted by the SAM model encoder are connected with a remote sensing image target detection and classification module. The scheme mainly utilizes the strong edge extraction capability of the SAM model, and then trains an independent remote sensing field target discrimination and label classification module by using samples in the remote sensing image field, and combines the extraction result of the SAM model and the sample knowledge in the remote sensing field.

And thirdly, the remote sensing image target detection module extracts candidate bounding boxes to be used as prompt information to be input into the SAM model. The scheme starts from the input of a SAM model, a training module of a remote sensing image sample is arranged in front, and a traditional target detection model is used first.

The scheme improves the detection effect of the universal target detection model applied to the field of remote sensing images to a certain extent, but still has obvious defects: in the first scheme, although the specification of any recognition target is supported through a text mode, the generalization support of input is enough, but the knowledge of a remote sensing image cannot be effectively introduced, and the recognition precision is poor. The second and third schemes respectively introduce supervised remote sensing image sample training by using separate modules, so that the recognition accuracy is improved, but the generalization support of input is insufficient, and the mode of combining the SAM model and the supervised training model introduces knowledge contained in the remote sensing image annotation sample, but limits the recognized target type. The model still can only detect the existing target types in the training sample, if the target types are newly identified, the module with supervision training needs to be retrained, and the generalization capability of the general target detection model can not be fully exerted.

The two points are main defects of the existing general target detection model in a remote sensing image application scheme, and influence the accuracy and generalization capability of the existing remote sensing image target detection.

The patent document with the document number of CN116824133A discloses an intelligent interpretation method of a remote sensing image, which relates to the technical field of remote sensing image processing, and comprises the following steps: the encoder module is a vector quantization variation self-encoder and is used for learning to obtain hidden layer characteristic representation; training an encoder module by using the first training set, and jointly training a central area module and a decoder module by using the first training set and the second training set; and acquiring a remote sensing image to be segmented, and inputting a trained semantic segmentation model to obtain a semantic segmentation result. The remote sensing image target classification detection model based on the supervised machine learning method training has a certain improvement on the accuracy of the remote sensing image compared with the original SAM model. But it exists at the same time: the problem of insufficient generalization capability is that the model with supervision training is relatively fixed in detection target type, and the effect is poor when the difference between the test image and the training sample is large.

Disclosure of Invention

The invention aims to provide a small sample learning remote sensing image detection method based on a SAM model, which can improve the generalization and the accuracy of the target detection of the existing remote sensing image.

The aim of the invention can be achieved by the following technical scheme: a method for detecting a remote sensing image of small sample learning based on a SAM model comprises the following steps:

s1, establishing an object sample library, and extracting sample labels and semantic tag features of each sample;

S2, inputting a remote sensing image and text description of a detection target to a detection model by a user, and preprocessing the remote sensing image and the text description by the detection model by using Grouding Dino model to obtain an initial labeling result and an initial semantic label of the detection target; preprocessing a remote sensing image by using the SAM model to obtain image characteristics;

S3, the detection model utilizes the initial labeling result and the initial semantic label of the detection target to be compared with the sample labeling and semantic label of the sample, comprehensive feature points are extracted, then image features obtained by the SAM model are overlapped to serve as prompt signals for further target detection of the SAM model, and the detection model obtains the extraction result under the guidance of a sample library.

Further: in the step S2, the Grouding Dino model is utilized to preprocess the input remote sensing image and the text description of the detection target, and the process formula is as follows:

（1）

（2）

（3）

Wherein I is a remote sensing image input by a user, text is a detection target text description, Φ _dino-img-enc represents an image encoder of Grouding Dino, F _img-dino is an output quantity of Φ _dino-img-enc, Φ _{dino-text-enc} represents a text encoder of Grouding Dino, F _text is an output quantity of Φ _{dino-text-enc}, Φ _dino-dec represents a decoder of Grouding Dino, M _dino is a detection target initial labeling result, and Lable _dino is a detection target initial semantic tag.

Further: the SAM model in S2 preprocesses the input remote sensing image, and the process formula is as follows:

（4）

Wherein Φ _i-enc is a SAM model pre-training image encoder, I is a user input remote sensing image, and F _img is an image feature.

Further: in the step S3, the step of extracting the comprehensive feature points is as follows:

S3A1, performing space multiplication by using the sample label M _R and the sample image characteristic F _R, and calculating to obtain the distribution similarity of the sample label M _R and the sample image characteristic F _R ；

S3A2, distribution similarity from sample imagesPerforming cosine similarity cosine calculation with the image feature F to obtain probability S that each pixel of the remote sensing image accords with the target mark;

S3A3, obtaining the highest confidence coefficient P _high and the lowest confidence coefficient P _low of each pixel of the input remote sensing image according to the probability S;

S3A4, extracting comprehensive feature points by using a promt encoder phi _p-enc of the SAM model according to the highest confidence coefficient P _high and the lowest confidence coefficient P _low, wherein the formula is as follows:

（5）

Wherein Φ _p-enc is a promt encoder of the SAM model, and T _prompt is a comprehensive feature point.

Further: and S3, overlapping image features as further prompt signals for SAM model target detection, wherein the steps are as follows:

S3B1, combining the detection target initial labeling results M _dino obtained by the Grouding Dino model, and obtaining an M _dino,Lable_M obtaining process formula of Lable _M with the highest degree of correlation according to the degree of correlation between initial semantic tags Lable _dino and Lable _R of each initial labeling result M _dino, wherein the formula is as follows:

（6）

Wherein sim is the Chinese word vector correlation calculation;

S3B2 and M _dino are extracted by a promt encoder of the SAM model to obtain T _dino, and the process formula is as follows:

= （7）

S3B3, T _dino and the comprehensive feature point T _prompt are overlapped and are jointly used as a prompt of a SAM model to be input into a decoder phi _m-dec for decoding, so that the detection accuracy of the SAM model is improved, and the process formula is as follows:

（8）

Wherein T _out is a prompt super parameter of the SAM model for boundary extraction, and is a fixed value;

S3B4, repeating the steps S3B 1-S3B 3 for N times to obtain the final output of the detection model, wherein the process formula is as follows:

（9）

Wherein r is E [1, N ], N is a positive integer, N is a number of custom samples, The value of T _dino at the r-th time,The value of T _prompt at the r-th time.

Further: the probability S is obtained by calculating a plurality of channels of the image, and the process formula is as follows:

（10）

（11）

（12）

Where i.e.1, n represents a plurality of channels of the image, For distribution similarity, M _R is sample labeling and F _R is sample image feature, S ⁱ is the result value of the ith channel, and S is the result mean of multiple channels.

Further: and in the S1, an object sample library mode is established, wherein the mode comprises user-defined sample adding.

The invention has the beneficial effects that:

1. According to the invention, through the application of a general target detection model SAM model and a text-image feature conversion model Grouding Dino, the conversion features from a text mode to an image mode are obtained by utilizing Grouding Dino, a preliminary candidate region is obtained by utilizing the generalized segmentation capability of the SAM model, and meanwhile, a small amount of annotation sample knowledge of a remote sensing image is preprocessed to be used as a feature sample, so that the feature sample is combined with the capability of the SAM model for identifying everything, and on one hand, the defect of the application of the SAM model in the remote sensing geographic image field is overcome, and a semantic label is added for an identification object; the text modal input expands the object description recognition capability, the structure expandability is strong, the primary candidate region is obtained by utilizing the generalization segmentation capability of the SAM model, the two general target detection models provide basic features of sufficient generalization for subsequent processing, and the overall generalization capability of the detection models is improved.

2. According to the invention, a small quantity of annotation sample libraries of the remote sensing image are constructed, image features in the annotation sample of the remote sensing image are preprocessed to be used as feature samples, so that the defect of application of the SAM model in the field of remote sensing geographic images is overcome, the identification accuracy of a basic model of the SAM model in the field of remote sensing images is improved through the newly added subdivision sample libraries, and semantic tags are added for identification objects. Meanwhile, fine adjustment is not needed for samples in the subdivision field introduced by the scheme, so that a user is allowed to introduce new samples in real time, the generalization capability of the model is reserved while the accuracy of the model is improved, and the target detection accuracy is improved.

3. The small sample self-adaptive detection model structure designed by the invention can realize improvement of the segmentation result of the SAM model under the condition that model training is not needed, and combines the text-image conversion capability of Grouding Dino with the self-defined sample, and the output result of the scheme also has semantic tags of each identification object, so that the practicability of the universal target detection model in the field of application can be obviously improved.

4. The model structure provided by the invention can be used in the subdivision fields such as remote sensing images and the like, improves the generalization segmentation capability of the SAM model, can conveniently carry out knowledge of samples in the subdivision fields, is also suitable for any other fields, has strong flexibility and expansibility, and can obviously improve the floor application effect of the universal target detection model in specific fields.

Drawings

FIG. 1 is a schematic diagram of a small sample learning remote sensing image detection method based on a SAM model;

FIG. 2 is a user-added custom sample artwork according to the present invention.

FIG. 3 is a schematic diagram of a user-added custom sample according to the present invention.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar symbols indicate like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are exemplary only for explaining the present invention and are not to be construed as limiting the present invention.

As shown in fig. 1-3, the invention discloses a method for detecting a small sample learning remote sensing image based on a SAM model, which comprises the following steps:

S2, inputting a remote sensing image and a detection target text description into a detection model by a user, and preprocessing the remote sensing image and the detection target text description by the detection model by using a Grouding Dino model to obtain an initial labeling result and an initial semantic label of the detection target; preprocessing a remote sensing image by using the SAM model to obtain image characteristics;

S3, the detection model utilizes an initial labeling result and an initial semantic label of the detection target, and the sample labeling and the semantic label to conduct comprehensive feature point extraction, and superimposes image features obtained by the SAM model to serve as a further prompt signal for detecting the SAM model target, so that the detection model can obtain an extraction result under the guidance of a sample library.

The following is a description of key terms and english abbreviations to which the invention relates:

SAM model: SEGMENT ANYTHING model, namely segmentation of all models, is a generic model that handles image segmentation tasks. Compared with the traditional image segmentation model, the SAM model can extract the boundary of the content in the image based on various input prompts, and can realize better image segmentation effect on the scene which is not seen by the model or is relatively blurred.

Grouding Dino: the method is a text-image cross-mode open-set image target detection model, and can generate corresponding image features and candidate target detection frames according to any text.

The text-remote sensing image cross-mode detection model mainly takes a general target detection model as a base, a remote sensing image sample library is introduced in a plug-and-play mode, the whole model structure diagram is shown in figure 1,

As shown in part B of fig. 1, a sample library of self-defined image samples with sample labels and semantic tags may be predefined, or may be uploaded by a user when in use, where the sample library is a sample library of self-defined image samples with display semantic tags, sample_mask [ 1..n ], each sample contains an original image I _R and a sample label M _R, and a semantic tag Lable _R, and the sample library extraction features refer to the sample notes M _R and the semantic tags Lable _R based on the original image I _R in the sample library.

The method comprises the steps of carrying out characteristic extraction of a custom image sample library example_mask [ 1..N ], wherein a tool for characteristic extraction uses an image encoder phi _i-enc of a SAM model, and the process formula is as follows:

[1,N] （13）

For each group of sample images and marked samples in the sample library, calculating the position confidence coefficient distribution value of the sample images, and performing space multiplication by using the sample marks M _R and the sample image features F _R to obtain the distribution similarity of the sample marks and the sample images ; Then the similarity is distributed by the sample imageAnd performing cosine similarity cosine calculation on the image characteristic F to obtain probability S that each pixel of the input remote sensing image accords with the target mark.

The probability S can be obtained by calculating a plurality of channels of the image, and the process formula is as follows:

（10）

（11）

（12）

wherein i epsilon [1, n ] represents that a plurality of channels of the image participate in the calculation, Is the result average of a plurality of channels.

According to the calculation result of the probability S, two points P _high and P _low with the highest confidence coefficient and the lowest confidence coefficient are found, the most probable point and the least probable point (namely more probable belonging to background points) of the detection target are respectively represented, the confidence coefficient value corresponding to P _high can be set to be larger than a specified threshold value of 0.9, the two points P _high and P _low are used as positive and negative prompt points of the SAM model, and the SAM model is guided to correctly identify whether the detection target exists in the image. If the confidence level corresponding to P _high is less than the threshold value of 0.9, it may be assumed that the sample is not present in the input remote sensing image.

Two points P _high and P _low are taken as positive and negative prompt points of the SAM model, P _high and P _low are taken as point type prompt (prompt) of the SAM model, and the input prompt encoder phi _p-enc is used for extracting coding features, and the process formula is as follows:

（5）

Wherein Φ _p-enc is a prompt encoder of the SAM model, T _prompt is a comprehensive feature point, and the comprehensive feature point is used for prompting the next step to guide the SAM model to extract the final key result.

When the method is used, as shown in a part A in fig. 1, a user inputs a remote sensing image and a detection target text description to a detection model, and the detection model utilizes Grouding Dino models to preprocess the input remote sensing image and the detection target text description to obtain an initial labeling result M _dino and an initial semantic label Lable _dino of the detection target;

grounding Dino is an open set target detector supporting text-image multi-mode input, and can output multiple pairs of candidate region regions for input remote sensing image (image, I) -detection target text description (text) input by a user, wherein each pair is marked with an initial marking result And corresponding semantic tagsGrounding Dino includes three main components: image encoder, text encoder and decoder, the process can be expressed as:

（1）

（2）

（3）

The image features are obtained by preprocessing an input remote sensing image with a SAM model, which is an interactive segmentation framework that generates segmentation results based on a given prompt (prompt) (e.g., foreground/background point, region bounding box region, or mask), and comprises three main components: an image encoder, a sympt encoder, and a mask decoder.

The SAM model carries out a preprocessing process formula for the input remote sensing image as follows:

（4）

Wherein Φ _i-enc is a SAM model pre-training image encoder, I is a user input remote sensing image, and F _img is an image feature. The SAM model processes the image into intermediate image features F _img using a Vision Transformer (ViT) based pre-training image encoder Φ _i-enc.

The input remote sensing image I candidate region Grouding Dino labels, namely M _dino and Lable _dino, are obtained through pretreatment of Grouding Dino and a SAM model on the input remote sensing image I, and meanwhile, image features F _img are obtained based on the SAM model.

Further, as shown in part C of fig. 1, the M _dino and Lable _dino obtained by preprocessing are combined with the sample features extracted from the sample library to perform comprehensive feature extraction, and the combined features are overlapped to serve as a further prompt signal for SAM model target detection, so that an extraction result under the guidance of the sample library is finally obtained, including the region of the recognition target object and the corresponding semantic label.

Specifically, in combination with the multiple candidate region (region) types M _dino obtained by the Grouding Dino model, according to the correlation calculation between the semantic tag Lable _dino and the sample tag Lable _R of each candidate region (region) type M _dino, the correlation calculation can use chinese word vector correlation calculation, where the process formula is:

（6）

wherein, And calculating the relevance of the Chinese word vector.

According to Lable _M, M _dino,M_dino of a candidate rectangular region (region) most relevant to the acquisition threshold is extracted by a sample encoder of a SAM model to obtain T _dino, and the process formula is as follows:

= （7）

T _dino is overlapped with T _prompt after the point prompt information is encoded, and the point prompt information is jointly used as a prompt of a SAM model to be decoded by a decoder phi _m-dec of the SAM model, so that the positioning accuracy of the SAM model detection can be further improved, and the process formula is as follows:

（8）

repeating the steps for N times, respectively calculating for all the provided N custom samples, and finally identifying all the detection target areas which possibly exist:

（9）

The following description of the embodiment application is made:

referring to fig. 1, a user inputs a remote sensing image to be detected as an input remote sensing image, and the input remote sensing image is used for detecting a target text description of "extracting sports fields such as football fields and basketball fields in the drawing".

The detection model is firstly extracted through Grouding Dino to obtain M _dino, wherein the M _dino comprises initial labeling results M _dino of 3 candidate areas (regions) and semantic tags, and the semantic tags are respectively a football field, a basketball field and a sports field.

The existing sample library has 10 related detection targets, and semantic tags are respectively "football field", "basketball field", "plastic track", "building", "tree", "road", "storage tank", "airplane", "automobile" and "ship".

The semantic tags of the 3 candidate areas M _dino are found to be the most similar to tags of 3 target samples of football court, basketball court and plastic track in the sample library through calculation, so that the maximum probability feature points and background points corresponding to the 3 samples in the input picture are extracted respectively.

And then 3 candidate areas in M _dino are respectively combined with the corresponding characteristic points and the background point characteristics for 3 times and used as prompt information input of the SAM model, and finally, 3 target detection results under the guidance of the sample library are obtained. Namely basketball court, plastic course, football court detection area marked with wire frame in part C in figure 1.

The establishment and update of the sample library can comprise user-defined sample addition, as shown in fig. 2, when a user finds that the existing identification type of the sample library cannot meet the detection requirement and wants to add the identification target of the 'roundabout intersection', the user only needs to upload an image marked with the detection target, expand the sample library, and the detection model can mark whether the similar detection target exists in the newly input remote sensing image or not as shown in fig. 3.

The invention discloses a method for realizing a model for detecting a text custom image target in the field of remote sensing images, which combines a general image target detection basic model and a new sample in the field of remote sensing, wherein a user can designate a detection target in any remote sensing image through natural language, and the model outputs an identified object region in the remote sensing image and gives a semantic label; the general image target detection basic model comprises a text-image feature conversion model Grouding Dino and a general target detection model SAM model, the conversion features from a text mode to an image mode are obtained by utilizing Grouding Dino, a preliminary candidate region is obtained by utilizing the generalized segmentation capability of the SAM model, a small amount of marked sample knowledge of a remote sensing image is preprocessed to serve as a feature sample, the feature sample is combined with the capability of the SAM model for identifying everything, on one hand, the defect of the application of the SAM model in the remote sensing geographic image field is overcome, the identification accuracy of the SAM model basic model in the remote sensing image field is improved by adding a sub-sample library, and a semantic label is added for identifying objects; on the other hand, the text mode input of the invention expands the description capability of the recognition object, has strong structural expandability, allows a user to customize a new sample and set a special detection target, realizes the rapid adaptation of the newly added recognition target, and expands the application scene of the SAM model.

Meanwhile, in order to fully utilize the recognition capability of the SAM model segmentation everywhere to improve the generalization of remote sensing image target detection and support the free description of detection requirements of users in the form of natural language, the small sample self-adaptive model structure designed by the invention can realize the improvement of the segmentation result of the SAM model under the condition that model retraining is not needed, and combines the text-image conversion capability of Grouding Dino and the self-defined samples, the semantic tag of each recognition object is also arranged in the output result of the scheme, and the model structure provided by the invention can utilize and improve the generalization segmentation capability of the SAM model in the subdivision field of remote sensing images and the like, better supports the diversified detection requirements of users, better fuses the sample knowledge of the remote sensing images, realizes the improvement of detection effects with low resources, has high adaptive response speed and stronger generalization, and can obviously improve the intelligent experience of the interactive query of the remote sensing images.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

It is to be understood that the terms "center," "longitudinal," "transverse," "length," "width," "thickness," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," "clockwise," "counter-clockwise," "axial," "radial," "circumferential," and the like are directional or positional relationships as indicated based on the drawings, merely to facilitate describing the invention and to simplify the description, and do not indicate or imply that the devices or elements referred to must have a particular orientation, be configured and operated in a particular orientation, and therefore should not be construed as limiting the invention.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present invention, the meaning of "plurality" means at least two, for example, two, three, etc., unless specifically defined otherwise.

In the present invention, unless explicitly specified and limited otherwise, the terms "mounted," "connected," "secured," and the like are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally formed; may be mechanically connected, may be electrically connected or may be in communication with each other; either directly or indirectly, through intermediaries, or both, may be in communication with each other or in interaction with each other, unless expressly defined otherwise. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art according to the specific circumstances.

In the present invention, unless expressly stated or limited otherwise, a first feature "up" or "down" a second feature may be the first and second features in direct contact, or the first and second features in indirect contact via an intervening medium. Moreover, a first feature being "above," "over" and "on" a second feature may be a first feature being directly above or obliquely above the second feature, or simply indicating that the first feature is level higher than the second feature. The first feature being "under", "below" and "beneath" the second feature may be the first feature being directly under or obliquely below the second feature, or simply indicating that the first feature is less level than the second feature.

Claims

1. The method for detecting the remote sensing image by learning the small sample based on the SAM model is characterized by comprising the following steps:

S2, inputting a remote sensing image and text description of a detection target to a detection model by a user, and preprocessing the remote sensing image and the text description by the detection model by using Grounding Dino model to obtain an initial labeling result and an initial semantic label of the detection target; preprocessing a remote sensing image by using the SAM model to obtain image characteristics;

s3, the detection model utilizes an initial labeling result and an initial semantic label of a detection target to be compared with a sample labeling and semantic label of a sample to extract comprehensive feature points, and then image features acquired by the SAM model are overlapped to serve as prompt signals for further target detection of the SAM model, so that the detection model acquires an extraction result under the guidance of a sample library;

in the step S3, the step of extracting the comprehensive feature points is as follows:

S3A1, performing space multiplication by using the sample label M _R and the sample image characteristic F _R, and calculating to obtain the distribution similarity of the sample label M _R and the sample image characteristic F _R ; Wherein the sample image feature F _R is acquired by the image encoder Φ _i-enc using the SAM model;

S3A2, distribution similarity from sample images Performing cosine similarity cosine calculation with the image feature F _img to obtain probability S that each pixel of the remote sensing image accords with the target mark;

（5）

Wherein phi _p-enc is a promt encoder of the SAM model, and T _prompt is a comprehensive feature point;

And S3, overlapping image features as further prompt signals for SAM model target detection, wherein the steps are as follows:

S3B1, combining the detection target initial labeling results M _dino obtained by the Grounding Dino model, and obtaining an M _dino,Lable_M obtaining process formula of Lable _M with the highest degree of correlation according to the degree of correlation between initial semantic tags Lable _dino and Lable _R of each initial labeling result M _dino, wherein the formula is as follows:

（6）

Wherein sim is the Chinese word vector correlation calculation; lable _R is a semantic tag;

（7）

（8）

（9）

Wherein r is E [1, N ], N is a positive integer, N is a number of custom samples, T _dino value of the r-th time,/>The value of T _prompt at the r-th time.

2. The SAM model-based small sample learning remote sensing image detection method of claim 1, wherein: in the step S2, the Grounding Dino model is utilized to preprocess the input remote sensing image and the text description of the detection target, and the process formula is as follows:

（1）

（2）

（3）

Wherein I is a remote sensing image input by a user, text is a detection target text description, Φ _dino-img-enc represents an image encoder of Grounding Dino, F _img-dino is an output quantity of Φ _dino-img-enc, Φ _{dino-text-enc} represents a text encoder of Grounding Dino, F _text is an output quantity of Φ _{dino-text-enc}, Φ _dino-dec represents a decoder of Grounding Dino, M _dino is a detection target initial labeling result, and Lable _dino is a detection target initial semantic tag.

3. The SAM model-based small sample learning remote sensing image detection method of claim 2, wherein: the SAM model in S2 preprocesses the input remote sensing image, and the process formula is as follows:

（4）

4. The SAM model-based small sample learning remote sensing image detection method of claim 3, wherein: the probability S is obtained by calculating a plurality of channels of the image, and the process formula is as follows:

（10）

（11）

（12）

5. The SAM model-based small sample learning remote sensing image detection method of claim 1, wherein: and in the S1, an object sample library mode is established, wherein the mode comprises user-defined sample adding.