CN116091787B - Small sample target detection method based on feature filtering and feature alignment - Google Patents

Small sample target detection method based on feature filtering and feature alignment Download PDF

Info

Publication number
CN116091787B
CN116091787B CN202211228411.7A CN202211228411A CN116091787B CN 116091787 B CN116091787 B CN 116091787B CN 202211228411 A CN202211228411 A CN 202211228411A CN 116091787 B CN116091787 B CN 116091787B
Authority
CN
China
Prior art keywords
feature
feature map
layer
convolution
support set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211228411.7A
Other languages
Chinese (zh)
Other versions
CN116091787A (en
Inventor
王勇
杨亮
张彤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN202211228411.7A priority Critical patent/CN116091787B/en
Publication of CN116091787A publication Critical patent/CN116091787A/en
Application granted granted Critical
Publication of CN116091787B publication Critical patent/CN116091787B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • G06V10/765Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects using rules for classification or partitioning the feature space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/7715Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a small sample target detection method based on feature filtering and feature alignment, which establishes a small sample target detection model, divides the generated query set and support set features into key and value feature pairs through different convolution codes, inputs the feature pairs into a space cross attention and channel cross attention feature filtering module to obtain features subjected to double cross attention filtering, inputs the features after splicing into an RPN network to generate candidate frames, carries out ROIAlign on the candidate frames and the features to obtain candidate features, fuses the candidate features and the support set features to generate deformable convolution kernels, carries out feature correction alignment by utilizing the deformable convolution check candidate features, and finally outputs corresponding category probabilities of the candidate features and fine regression of the frames respectively. Compared with the existing image segmentation method, the method can provide more accurate detection under the condition of sample scarcity, is suitable for target detection under various complex scenes, and has stronger generalization performance and better detection effect.

Description

Small sample target detection method based on feature filtering and feature alignment
Technical Field
The invention relates to the technical field of computer vision, in particular to a small sample target detection method based on feature filtering and feature alignment.
Background
The object detection is an important component branch in computer vision, is a branch with strong practicability at present, aims to find a specific object from an image and give the category and the position of the object, and is widely applied to numerous scenes such as military, security, monitoring, automatic driving and the like. At present, a target detection model with better performance is generally designed based on deep learning, and as with other deep learning methods, the target detection based on the deep learning generally needs a large amount of data sets to train, however, in many fields and scenes, acquiring a large amount of data is very difficult.
The framework adopted by the existing small sample target detection technology is mainly divided into two stages, namely extraction of candidate frames and fine regression and classification of the frames. The invention patent number CN113221987A discloses a small sample target detection method based on a cross attention mechanism, a small sample target detection model is constructed, the model takes a ResNet-50 model as a main network, a cross attention module is adopted for feature fusion and enhancement, but the method uses multi-head attention from natural language processing, only carries out two classification on a single-class target in a classification stage, and the efficiency is low; hanzheHu et al disclose in research results that context-aware aggregation-based dense relational rectification is used for detecting few shot targets, and that a dense relational distillation module is adopted, and also cross-attention operation is adopted to mine the degree of association of two data sets (or two graphs) in a certain measurement dimension, so that a model is used for strengthening the characteristics in the dimensions, but the method ignores the association of a support set and a query set in a channel dimension, and the diversity of the filtered characteristics is insufficient, so that the accuracy of target detection is still to be improved.
Part of the term interpretation referred to in the specification: stemConv is the base convolution layer; layer1, layer2, layer3, layer4 are each residual block layers in ResNet.
Disclosure of Invention
In order to solve the technical problems, the invention discloses a small sample target detection method based on feature filtering and feature alignment, which improves the target detection accuracy under the condition of sample scarcity.
A small sample target detection method based on feature filtering and feature alignment comprises the following steps:
s1: the method comprises the steps of establishing a feature generation network constructed by a multi-level residual error module and a feature fusion module, sharing main weight of the feature generation network, and respectively carrying out feature generation on support set data and query set data to obtain a category prototype of each category of data in the support set and a context aggregation feature map of five levels of the query set;
S2: establishing a double-cross attention feature filtering module, filtering the context aggregation features of the support set by using the double-cross attention feature filtering module, reserving the context aggregation features with higher association degree with the five-level context aggregation feature graphs of the query set in the support set, filtering out the features with smaller correlation, and enhancing the feature perception capability of the model on the target to be identified in the query set to obtain five filtered feature graphs corresponding to the five-level context aggregation feature graphs in the query set;
S3: inputting the characteristics of the filtered characteristic diagram into a candidate frame suggestion network, and mapping the candidate frame suggestion network back to the original diagram to generate a candidate frame according to the characteristics of each level;
S4: extracting candidate features of the filtered feature images of the corresponding levels by using the generated candidate frames to obtain a candidate frame feature images, wherein a is more than or equal to 1000, a is preferably 2000, a deformable feature alignment module is established, each candidate frame feature image is respectively fused with a support set category prototype set feature image to generate a deformable convolution kernel corresponding to each candidate frame feature image, each candidate frame feature image is checked by using the deformable convolution kernel to perform feature alignment, and finally category probability corresponding to the candidate frame feature image and fine regression of the frame are respectively output.
Further, the multi-level residual error module is composed of five levels of stemConv, layer, layer2, layer3 and layer 4; the characteristic fusion enhancement module is formed by combining 41 x1 convolution layers and 43 x3 convolution layers;
The stemConv level contains a convolution layer with a convolution kernel 7*7, and the convolution layer of 7*7 performs different implementations for extracting the query set and the support set features: for the query set, the number of input channels of the convolution layer is 3, and the input channels respectively correspond to three RGB color channels; for the support set, the number of input channels of the convolution layer is 4, and the input channels correspond to three color channels and a mask channel respectively; the stemConv layers also include BatchNorm d layer, reLU activation function layer, and MaxPool d layer;
The layers 1,2,3 and 4 are composed of a plurality of residual blocks.
Further, the specific obtaining method of the category prototype of each type of data in the support set and the context aggregation feature map of five levels in the query set is as follows:
S1-1: inputting a query set picture or a picture of a support set picture to stemConv layers to generate a first feature picture C1, generating a second feature picture C2 with the same width and height and 256 channels by the first feature picture C1 through the layers 1, sequentially performing channel doubling and width and height halving operations on the feature picture generated by the previous layer by the three layers, and sequentially generating a third feature picture C3, a fourth feature picture C4 and a fifth feature picture C5;
S1-2: the features of the second feature map C2, the third feature map C3, the fourth feature map C4 and the fifth feature map C5 are subjected to one 1x1 convolution to perform one-time channel transformation, so that a second middle feature map M2, a third middle feature map M3, a fourth middle feature map M4 and a fifth middle feature map M5 with the channel numbers of 256 are generated;
S1-3: the fifth intermediate feature map M5 is convolved by 3*3 to generate a fifth context aggregation feature map P5, the fifth context aggregation feature map P5 is downsampled to generate a sixth context aggregation feature map P6, the upsampled feature map of the fifth intermediate feature map M5 is correspondingly added to the fourth intermediate feature map M4, and is convolved by 3*3 to generate a fourth context aggregation feature map P4, the upsampled feature map of the fourth intermediate feature map M4 and the upsampled feature map of the third intermediate feature map M3 are respectively fused with the third intermediate feature map M3 and the second intermediate feature map M2 to generate a third context aggregation feature map P3 and a second context aggregation feature map P2;
S1-4: taking a second context aggregation feature map P2, a third context aggregation feature map P3, a fourth context aggregation feature map P4, a fifth context aggregation feature map P5 and a sixth context aggregation feature map P6 as five-level context aggregation feature maps of the query set; for the support set, the third context aggregation feature map P3 is taken as a category prototype for each type of data in the support set.
Further, the dual cross-attention feature filter module is mainly composed of a feature key value pair generation layer, a space filter generation layer, a channel filter generation layer and a feature filter splicing layer. The characteristic key value pair generation layer consists of three inquiry transformation convolution layers and four support transformation convolution layers; the spatial filter generation layer is composed of an initial spatial filter generation layer F spatial and a softmax layer; the channel filter generation layer is composed of an initial channel filter generation layer F channel and a softmax layer; the characteristic filtering splicing layer consists of a filtering layer and a splicing layer.
Further, the specific obtaining mode of the filtered characteristic diagram is as follows:
s2-1: respectively inputting the context aggregation feature graphs of each level in the query set into a query transformation convolution layer of a feature key value pair generation layer to perform convolution transformation, and generating a first feature key of the query set by the context aggregation feature graphs of each level Query set second feature key/>And query set first eigenvalue/>And meet/>Wherein C is the number of channels, H is the feature height, W is the feature width,/>Is real, { k, v } q overall represents/>Three; in a support convolution transformation layer, firstly splicing class prototypes of each class of data in a support set to obtain a support set class prototype set feature map with the size of N multiplied by C multiplied by H multiplied by W, wherein N is the class number of the support set, and then performing convolution transformation on the spliced support set class prototype set feature map to generate a support set first feature key/>Support set second feature bond/>Support set first eigenvalue/>And support set second eigenvalue/>And meet/>
S2-2: in the spatial filter generation layer, the initial spatial filter generation layer is obtained by first performing the following matrix operation
Where i, j is the spatial location index of the query set and the support set, respectively,And/>All are linear transformation functions, and the corresponding features are transformed into the size and shape suitable for matrix operation; reuse of initial spatial filter generation layersThe spatial filter characteristic f spatial is obtained after the following softmax operation
Wherein,The values of the resulting spatial position indices are calculated according to the softmax algorithm.
S2-3: in the channel filter generation layer, the initial channel filter generation layer is obtained by first performing the following matrix operation
Where i, j is the channel index for the query set and the support set, respectively,And/>All are linear transformation functions, and the corresponding features are transformed into the size and shape suitable for matrix operation; the channel filter characteristic f channel is obtained after further softmax calculation, which is the channel filter characteristic/>
Wherein,The values of the resulting channel position indices are calculated in accordance with the softmax algorithm.
S2-4: support set first feature values using spatial filter feature f spatial and channel filter feature f channel, respectivelyAnd support set second eigenvalue/>Filtering to obtain a space cross-filtering feature map and a channel cross-filtering feature map respectively, and then performing space cross-filtering feature map and channel cross-filtering feature map and query set first feature values according to the following formulaSplicing to obtain a filtered feature map/>, which corresponds to the context aggregation feature map of each level in the query set
Further, the candidate region suggestion network is composed of a confidence coefficient convolution layer and a frame regression convolution layer, the convolution kernel sizes of the confidence coefficient convolution layer and the frame regression convolution layer are 1*1, wherein the number of output channels of the confidence coefficient convolution layer is the number of anchors, the number of output channels of the frame regression convolution layer is 4, and three candidate frames with different scales are predicted for characteristic element points of each filtered characteristic image in the setting according to the transverse or longitudinal proportion of a target, namely the set number of anchors is 3.
Further, the specific obtaining method of the class probability corresponding to the candidate frame feature map and the fine regression of the frame is as follows:
S4-1: carrying out ROIAlign on each candidate frame respectively, namely acquiring feature points on the corresponding filtered feature images according to the sampled candidate frame coordinates, carrying out interpolation operation by using a bilinear interpolation algorithm to acquire a candidate frame feature image corresponding to each candidate frame, wherein the size of the feature image is 7*7, and the number of channels is 256;
S4-2: splicing the candidate frame feature map obtained after each ROIAlign with the support set class prototype set feature map respectively, generating a deviation parameter map with the number of channels being 18 by 1*1 convolution operation on the spliced feature map, wherein the parameters of 18 channels represent sampling deviation of a deformable convolution kernel, the sampling deviation of each point comprises an x direction and a y direction, the size of the deformable convolution kernel is 3*3, and 18 deviation parameters are output;
S4-3: after a deformable convolution kernel is obtained, the deformable convolution kernel is used for carrying out deformation sampling on 2000 candidate frame feature images, so that aligned and corrected candidate frame feature images corresponding to each candidate frame are obtained, and the size and the channel number of the aligned and corrected candidate frame feature images are consistent with those of the candidate frame feature images;
s4-4: flattening the aligned and corrected candidate frame feature graphs into feature vectors with the length of 12544, sequentially passing through two full-connection layers, using ReLU as an activation function by each full-connection layer to finally obtain 2000 activated feature vectors with the length of 1024, respectively passing through a category full-connection layer and a frame regression full-connection layer to respectively obtain category probability and precise regression values of the frames, and finally obtaining precise frames with the confidence degree larger than a preset threshold value in each candidate frame and corresponding target categories and confidence degrees.
The invention has the beneficial effects that:
1. Compared with the prior art, the method and the device have the advantages that the model obtains the key features which are strongly related to the task from the image to be identified and a small amount of limited training samples through the two steps of feature filtering and feature alignment, the data representation and learning capacity of the target detection model are enhanced, and the target accurate detection under the condition of sample scarcity is realized.
Drawings
FIG. 1 is a block diagram of an overall network model structure of a small sample object detection method based on feature filtering and feature alignment in an embodiment of the present invention. Fig. 2 is a diagram of a feature extraction network according to an embodiment of the present invention.
Detailed Description
Embodiments of the present invention are further described below with reference to the accompanying drawings and examples. It should be noted that the examples do not limit the scope of the invention as claimed.
Example 1
As shown in fig. 1 to 2, a small sample target detection method based on feature filtering and feature alignment is characterized by comprising the following steps:
s1: the method comprises the steps of establishing a feature generation network constructed by a multi-level residual error module and a feature fusion module, sharing main weight of the feature generation network, and respectively carrying out feature generation on support set data and query set data to obtain a category prototype of each category of data in the support set and a context aggregation feature map of five levels of the query set; the specific implementation mode is as follows:
The multi-level residual error module consists of five levels of stemConv, layer, layer2, layer3 and layer 4; the stemConv level contains a convolution layer with a convolution kernel 7*7 that performs different implementations for extracting the query set and support set features: for the query set, the number of input channels of the convolution layer is 3, and the input channels respectively correspond to three RGB color channels; for the support set, the number of input channels of the convolution layer is 4, and the input channels correspond to three color channels and one mask channel respectively. The stemConv layer further comprises a BatchNorm d layer, a ReLU activation function layer and a MaxPool d layer, each layer1, 2,3 and 4 is composed of a plurality of residual blocks, the original query set picture or the support set picture is input to the stemConv layer to generate a first characteristic picture C1, the first characteristic picture C1 is then subjected to layer1 to generate a second characteristic picture C2 with the same width and height and 256 channels, the three subsequent layers sequentially perform operations of doubling the number of channels and halving the width and height of the characteristic picture generated by the previous layer, and a third characteristic picture C3, a fourth characteristic picture C4 and a fifth characteristic picture C5 are sequentially generated.
The feature fusion enhancement module is formed by combining 41 x1 convolution layers and 43 x3 convolution layers, a second feature map C2, a third feature map C3, a fourth feature map C4 and a fifth feature map C5 which correspond to four layers of layers are taken out, the second intermediate feature map M2, the third intermediate feature map M3, the fourth intermediate feature map M4 and the fifth intermediate feature map M5 are respectively generated by carrying out one-time channel transformation through 1x1 convolution, the channel numbers of the four intermediate feature maps are 256, the fifth intermediate feature map M5 is subjected to 3*3 convolution to generate a fifth context aggregation feature map P5, the fifth context aggregation feature map P5 is subjected to downsampling to generate a sixth context aggregation feature map P6, the upper sampled feature map of the fifth intermediate feature map M5 is added with the fourth intermediate feature map M4 to generate a fourth context aggregation feature map P4 through 3*3 convolution, the fourth context aggregation feature map M4 is also subjected to downsampling of the fifth intermediate feature map M5, the fourth context aggregation feature map M3 is also subjected to downsampling of the fifth context aggregation feature map P5, and the fourth context aggregation feature map P3 is generated, and the fifth context aggregation feature map P5 is generated by the fourth context aggregation feature map P4, namely the fourth context aggregation feature map P3 is generated, and the fourth context aggregation feature map P is generated. For a query set, features of the context aggregate feature map using five levels are used for subsequent steps; for the support set, since the second context aggregation feature map P2 is only an aggregation of the original map through one stemConv and one hierarchical residual block, the feature performance is weaker; the feature images of the fourth context aggregation feature image P4 and the fifth context aggregation feature image P5 are too small after the width-height halving operation is performed, and the space position information is less; the sixth contextual model P6 is not aggregated and is smaller than the fifth contextual model P5; the third context aggregated feature map P3 of each class of data in the single support set is thus selected, i.e. the class prototypes of each class of data in the support set are used for the subsequent steps.
S2: the method comprises the steps of establishing a dual cross attention feature filtering module, filtering context aggregation features of a support set by utilizing the dual cross attention feature filtering module, reserving the context aggregation features with higher association degree with the five-level context aggregation feature graphs of a query set in the support set, filtering out features with smaller correlation, enhancing the feature perception capability of a model on a target to be identified in the query set, and obtaining five filtered feature graphs corresponding to the five-level context aggregation feature graphs of the query set. The specific implementation method comprises the following steps:
And establishing a double cross-attention characteristic filtering module, wherein the double cross-attention characteristic filtering module mainly comprises a characteristic key value pair generation layer, a space filter generation layer, a channel filter generation layer and a characteristic filtering splicing layer. The characteristic key value pair generation layer consists of three inquiry transformation convolution layers and four support transformation convolution layers; the spatial filter generation layer is composed of an initial spatial filter generation layer F spatial and a softmax layer; the channel filter generation layer is composed of an initial channel filter generation layer F channel and a softmax layer; the characteristic filtering splicing layer consists of a filtering layer and a splicing layer.
S2-1: respectively inputting the context aggregation feature graphs of each level in the query set into a query transformation convolution layer of a feature key value pair generation layer to perform convolution transformation, and generating a first feature key of the query set by the context aggregation feature graphs of each levelQuery set second feature key/>And query set first eigenvalue/>And meet/>Wherein C is the number of channels, H is the feature height, W is the feature width,/>Is real, { k, v } q overall represents/>Three; in a support convolution transformation layer, firstly splicing class prototypes of each class of data in a support set to obtain a support set class prototype set feature map with the size of N multiplied by C multiplied by H multiplied by W, wherein N is the class number of the support set, and then performing convolution transformation on the spliced support set class prototype set feature map to generate a support set first feature key/>Support set second feature bond/>Support set first eigenvalue/>And support set second eigenvalue/>And meet/>
S2-2: in the spatial filter generation layer, the initial spatial filter generation layer is obtained by first performing the following matrix operation
Where i, j is the spatial location index of the query set and the support set, respectively,And/>All are linear transformation functions, and the corresponding features are transformed into the size and shape suitable for matrix operation; reuse of initial spatial filter generation layer/>The spatial filter characteristic f spatial is obtained after the following softmax operation
Wherein,The values of the resulting spatial position indices are calculated according to the softmax algorithm.
S2-3: in the channel filter generation layer, the initial channel filter generation layer is obtained by first performing the following matrix operation
Where i, j is the channel index for the query set and the support set, respectively,And/>All are linear transformation functions, and the corresponding features are transformed into the size and shape suitable for matrix operation; the channel filter characteristic f channel is obtained after further softmax calculation, which is the channel filter characteristic/>
Wherein,The values of the resulting channel position indices are calculated in accordance with the softmax algorithm.
S2-4: support set first feature values using spatial filter feature f spatial and channel filter feature f channel, respectivelyAnd support set second eigenvalue/>Filtering to obtain a space cross-filtering feature map and a channel cross-filtering feature map respectively, and then performing space cross-filtering feature map and channel cross-filtering feature map and query set first feature values according to the following formulaSplicing to obtain a filtered feature map/>, which corresponds to the context aggregation feature map of each level in the query set
S3: inputting the characteristics of the filtered characteristic diagram into a candidate frame suggestion network, and mapping the candidate frame suggestion network back to the original diagram according to the characteristics of each level to generate a candidate frame, wherein the specific implementation mode is as follows:
And (3) respectively inputting the characteristics of the five filtered characteristic graphs obtained in the step (S2) into candidate region suggestion networks. The candidate area proposal network consists of a confidence coefficient convolution layer and a frame regression convolution layer, wherein the convolution kernel sizes of the confidence coefficient convolution layer and the frame regression convolution layer are 1*1, the output channel number of the confidence coefficient convolution layer is the number of anchors, the output channel number of the frame regression convolution layer is 4 x anchors, three candidate frames with different scales are predicted for the characteristic element points of each filtered characteristic image according to the transverse or longitudinal proportion of a target in the setting, namely the set number of anchors is 3, each element point is mapped back to the corresponding scale of the original image to generate candidate frames, the candidate frames are ordered according to the anchor confidence coefficient obtained by the confidence coefficient convolution layer, the most likely truly existing candidate frames are arranged in front, and the most suitable 2000 candidate frames are selected by using non-maximum value inhibition.
S4: and extracting candidate features of the filtered feature images of the corresponding levels by using the generated candidate frames to obtain 2000 candidate frame feature images, establishing a deformable feature alignment module, respectively fusing each candidate frame feature image with the support set category prototype set feature images to generate a deformable convolution kernel corresponding to each candidate frame feature image, checking each candidate frame feature image by using the deformable convolution kernel to perform feature alignment, and finally respectively outputting category probability corresponding to the candidate frame feature images and fine regression of the frames. The specific implementation mode is as follows:
S4-1: carrying out ROIAlign on each candidate frame respectively, namely acquiring feature points on the corresponding filtered feature images according to the sampled candidate frame coordinates, carrying out interpolation operation by using a bilinear interpolation algorithm to acquire a candidate frame feature image corresponding to each candidate frame, wherein the size of the feature image is 7*7, and the number of channels is 256;
S4-2: splicing the candidate frame feature map obtained after each ROIAlign with the support set class prototype set feature map respectively, generating a deviation parameter map with the number of channels being 18 by 1*1 convolution operation on the spliced feature map, wherein the parameters of 18 channels represent sampling deviation of a deformable convolution kernel, the sampling deviation of each point comprises an x direction and a y direction, the size of the deformable convolution kernel is 3*3, and 18 deviation parameters are output;
S4-3: after a deformable convolution kernel is obtained, the deformable convolution kernel is used for carrying out deformation sampling on 2000 candidate frame feature images, so that aligned and corrected candidate frame feature images corresponding to each candidate frame are obtained, and the size and the channel number of the aligned and corrected candidate frame feature images are consistent with those of the candidate frame feature images;
S4-4: flattening the aligned and corrected candidate frame feature graphs into feature vectors with the length of 12544, sequentially passing through two full-connection layers, using ReLU as an activation function by each full-connection layer to finally obtain 2000 activated feature vectors with the length of 1024, respectively passing through a category full-connection layer and a frame regression full-connection layer to respectively obtain category probability and precise regression values of the frames, and finally obtaining the precise frames corresponding to the candidate frames and the corresponding target categories and confidence degrees.
The conventional deformable convolution operation is generally generated by means of the feature graphs of the conventional deformable convolution operation, the features are aligned only by the features, and in the method, a mode that the feature graphs are supposed to be deformed and aligned can be better obtained by splicing the feature graphs of the query set and the support set after the features of the query set are spliced, because the deformation features between the query set and the support set are associated with the feature graphs, and the use of the deformable convolution kernel to correct the aligned query set features can be better represented.
The invention respectively carries out training test verification on the public data set VOC and the actual battlefield data set, the data set division of the VOC is shown in table 1, and the result is shown in table 2.
Table 1: partitioning of VOC data sets
Table 2: the results of the invention on the Pascal VOC dataset were averaged over multiple random experiments (the result values are the average over three dataset partitions, expressed as mAP with IOU greater than 0.5):
The distribution and results on the actual battlefield dataset are shown in table 3:
TABLE 3 Table 3
As can be seen from the results in tables 2 and 3, compared with other models, the model of the invention enhances the retrieval capability of the query set to the support set, filters out the features with higher correlation, can learn the data features more fully under the limited data set, and improves the average accuracy of detection.
Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the technical solution of the present invention, which is intended to be covered by the scope of the claims of the present invention.

Claims (3)

1. The small sample target detection method based on feature filtering and feature alignment is characterized by comprising the following steps of:
s1: the method comprises the steps of establishing a feature generation network constructed by a multi-level residual error module and a feature fusion module, sharing main weight of the feature generation network, and respectively carrying out feature generation on support set data and query set data to obtain a category prototype of each category of data in the support set and a context aggregation feature map of five levels of the query set;
S2: establishing a double-cross attention feature filtering module, and filtering a category prototype of each type of data in the support set by using the double-cross attention feature filtering module to obtain five filtered feature graphs corresponding to the context aggregation feature graphs of five levels in the query set;
S3: inputting the characteristics of the filtered characteristic diagram into a candidate frame suggestion network, and mapping the candidate frame suggestion network back to the original diagram to generate a candidate frame according to the characteristics of each level;
S4: extracting candidate features of the filtered feature images of the corresponding levels by using the generated candidate frames to obtain a candidate frame feature images, wherein a is more than or equal to 1000, establishing a deformable feature alignment module, respectively fusing each candidate frame feature image with an aggregate feature image of a class prototype of each class of data in a supporting set to generate a deformable convolution kernel corresponding to each candidate frame feature image, checking each candidate frame feature image by using the deformable convolution kernel to perform feature alignment, and finally respectively outputting class probability corresponding to the candidate frame feature image and fine regression of the frame;
The multi-level residual error module consists of five layers of stemConv, layer, layer2, layer3 and layer 4; the characteristic fusion enhancement module is formed by combining 41 x1 convolution layers and 43 x3 convolution layers;
The stemConv level contains a convolution layer with a convolution kernel 7*7, and the convolution layer of 7*7 performs different implementations for extracting the query set and the support set features: for the query set, the number of input channels of the convolution layer is 3, and the input channels respectively correspond to three RGB color channels; for the support set, the number of input channels of the convolution layer is 4, and the input channels correspond to three color channels and a mask channel respectively; the stemConv layers also include BatchNorm d layer, reLU activation function layer, and MaxPool d layer;
Each layer1, layer2, layer3 and layer4 is composed of a plurality of residual blocks;
The specific obtaining method of the category prototype of each type of data in the support set and the context aggregation feature map of the query set at five levels is as follows:
S1-1: inputting a query set picture or a picture of a support set picture to stemConv layers to generate a first feature picture C1, generating a second feature picture C2 with the same width and height and 256 channels by the first feature picture C1 through the layers 1, sequentially performing channel doubling and width and height halving operations on the feature picture generated by the previous layer by the three layers, and sequentially generating a third feature picture C3, a fourth feature picture C4 and a fifth feature picture C5;
S1-2: the features of the second feature map C2, the third feature map C3, the fourth feature map C4 and the fifth feature map C5 are subjected to one 1x1 convolution to perform one-time channel transformation, so that a second middle feature map M2, a third middle feature map M3, a fourth middle feature map M4 and a fifth middle feature map M5 with the channel numbers of 256 are generated;
S1-3: the fifth intermediate feature map M5 is convolved by 3*3 to generate a fifth context aggregation feature map P5, the fifth context aggregation feature map P5 is downsampled to generate a sixth context aggregation feature map P6, the upsampled feature map of the fifth intermediate feature map M5 is correspondingly added to the fourth intermediate feature map M4, and is convolved by 3*3 to generate a fourth context aggregation feature map P4, the upsampled feature map of the fourth intermediate feature map M4 and the upsampled feature map of the third intermediate feature map M3 are respectively fused with the third intermediate feature map M3 and the second intermediate feature map M2 to generate a third context aggregation feature map P3 and a second context aggregation feature map P2;
S1-4: taking a second context aggregation feature map P2, a third context aggregation feature map P3, a fourth context aggregation feature map P4, a fifth context aggregation feature map P5 and a sixth context aggregation feature map P6 as five-level context aggregation feature maps of the query set; aiming at the support set, taking a third context aggregation feature map P3 as a category prototype of each type of data in the support set;
the double cross attention characteristic filter module consists of a characteristic key value pair generation layer, a space filter generation layer, a channel filter generation layer and a characteristic filter splicing layer; the characteristic key value pair generation layer consists of three inquiry transformation convolution layers and four support transformation convolution layers; the spatial filter generation layer is composed of an initial spatial filter generation layer F spatial and a softmax layer; the channel filter generation layer is composed of an initial channel filter generation layer F channel and a softmax layer; the characteristic filtering splicing layer consists of a filtering layer and a splicing layer;
The specific obtaining mode of the filtered characteristic diagram is as follows:
s2-1: respectively inputting the context aggregation feature graphs of each level in the query set into a query transformation convolution layer of a feature key value pair generation layer to perform convolution transformation, and generating a first feature key of the query set by the context aggregation feature graphs of each level Query set second feature key/>And query set first eigenvalue/>And meet/>Wherein C is the number of channels, H is the feature height, W is the feature width,/>Is a real number, { k, v } q represents the whole/>
In a support convolution transformation layer, firstly splicing class prototypes of each class of data in a support set to obtain a support set class prototype set feature map with the size of N multiplied by C multiplied by H multiplied by W, wherein N is the class number of the support set, and then performing convolution transformation on the spliced support set class prototype set feature map to generate a support set first feature keySupport set second feature bond/>Support set first eigenvalue/>And support set second eigenvalue/>And meet/> S is;
S2-2: in the spatial filter generation layer, the initial spatial filter generation layer is obtained by first performing the following matrix operation
Where i, j is the spatial location index of the query set and the support set, respectively,And/>All are linear transformation functions, and the corresponding features are transformed into the size and shape suitable for matrix operation; reuse of initial spatial filter generation layer/>The spatial filter characteristic f spatial is obtained after the following softmax operation, which spatial filter characteristic/>
Wherein,Calculating the value of each obtained spatial position index according to a softmax algorithm;
S2-3: in the channel filter generation layer, the initial channel filter generation layer is obtained by first performing the following matrix operation
Where i, j is the channel index for the query set and the support set, respectively,And/>All are linear transformation functions, and the corresponding features are transformed into the size and shape suitable for matrix operation; the channel filter characteristic f channel is obtained after further softmax calculation, which is the channel filter characteristic/>
Wherein,Calculating the value of the position index of each channel according to a softmax algorithm;
S2-4: support set first feature values using spatial filter feature f spatial and channel filter feature f channel, respectively And support set second eigenvalue/>Filtering to obtain a space cross-filtering feature map and a channel cross-filtering feature map respectively, and then carrying out space cross-filtering feature map and channel cross-filtering feature map and query set first feature value/>, according to the following stepsSplicing to obtain a filtered feature map/>, which corresponds to the context aggregation feature map of each level in the query set
2. The small sample target detection method based on feature filtering and feature alignment as claimed in claim 1, wherein the obtained features of the five filtered feature graphs are respectively input into candidate region suggestion networks, the candidate region suggestion networks are composed of confidence degree convolution layers and frame regression convolution layers, the convolution kernel sizes of the confidence degree convolution layers and the frame regression convolution layers are 1*1, the number of output channels of the confidence degree convolution layers is the number of anchors, the number of output channels of the frame regression convolution layers is 4, and three candidate frames with different scales are predicted for feature element points of each filtered feature graph in a setting mode according to the transverse or longitudinal proportion of a target, namely the set anchor number is 3.
3. The small sample target detection method based on feature filtering and feature alignment as claimed in claim 1, wherein the specific obtaining method of the class probability and the fine regression of the frame corresponding to the candidate frame feature map is as follows:
S4-1: carrying out ROIAlign on each candidate frame respectively, namely acquiring feature points on the corresponding filtered feature images according to the sampled candidate frame coordinates, carrying out interpolation operation by using a bilinear interpolation algorithm to acquire a candidate frame feature image corresponding to each candidate frame, wherein the size of the feature image is 7*7, and the number of channels is 256;
S4-2: splicing the candidate frame feature map obtained after each ROIAlign with the support set class prototype set feature map respectively, generating a deviation parameter map with the number of channels being 18 by 1*1 convolution operation on the spliced feature map, wherein the parameters of 18 channels represent sampling deviation of a deformable convolution kernel, the sampling deviation of each point comprises an x direction and a y direction, the size of the deformable convolution kernel is 3*3, and 18 deviation parameters are output;
S4-3: after a deformable convolution kernel is obtained, the deformable convolution kernel is used for carrying out deformation sampling on 2000 candidate frame feature images, so that aligned and corrected candidate frame feature images corresponding to each candidate frame are obtained, and the size and the channel number of the aligned and corrected candidate frame feature images are consistent with those of the candidate frame feature images;
s4-4: flattening the aligned and corrected candidate frame feature graphs into feature vectors with the length of 12544, sequentially passing through two full-connection layers, using ReLU as an activation function by each full-connection layer to finally obtain 2000 activated feature vectors with the length of 1024, respectively passing through a category full-connection layer and a frame regression full-connection layer to respectively obtain category probability and precise regression values of the frames, and finally obtaining precise frames with the confidence degree larger than a preset threshold value in each candidate frame and corresponding target categories and confidence degrees.
CN202211228411.7A 2022-10-08 2022-10-08 Small sample target detection method based on feature filtering and feature alignment Active CN116091787B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211228411.7A CN116091787B (en) 2022-10-08 2022-10-08 Small sample target detection method based on feature filtering and feature alignment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211228411.7A CN116091787B (en) 2022-10-08 2022-10-08 Small sample target detection method based on feature filtering and feature alignment

Publications (2)

Publication Number Publication Date
CN116091787A CN116091787A (en) 2023-05-09
CN116091787B true CN116091787B (en) 2024-06-18

Family

ID=86206981

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211228411.7A Active CN116091787B (en) 2022-10-08 2022-10-08 Small sample target detection method based on feature filtering and feature alignment

Country Status (1)

Country Link
CN (1) CN116091787B (en)

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
RU2524399C1 (en) * 2013-05-13 2014-07-27 Открытое акционерное общество "Конструкторское бюро по радиоконтролю систем управления, навигации и связи" (ОАО "КБ "Связь") Method of detecting small-size mobile objects
CN110349135A (en) * 2019-06-27 2019-10-18 歌尔股份有限公司 Object detection method and device
CN111738110A (en) * 2020-06-10 2020-10-02 杭州电子科技大学 Remote sensing image vehicle target detection method based on multi-scale attention mechanism
CN112560876B (en) * 2021-02-23 2021-05-11 中国科学院自动化研究所 Single-stage small sample target detection method for decoupling measurement
CN112883936A (en) * 2021-04-08 2021-06-01 桂林电子科技大学 Method and system for detecting vehicle violation
CN114494373A (en) * 2022-01-19 2022-05-13 深圳市比一比网络科技有限公司 High-precision rail alignment method and system based on target detection and image registration
CN114818963B (en) * 2022-05-10 2023-05-09 电子科技大学 Small sample detection method based on cross-image feature fusion
CN114998760A (en) * 2022-05-30 2022-09-02 河北工业大学 Radar image ship detection network model and detection method based on domain adaptation
CN115019103A (en) * 2022-06-20 2022-09-06 杭州电子科技大学 Small sample target detection method based on coordinate attention group optimization

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Qin Li ; Yong Wang.A Novel Teacher-Assistance-Based Method to Detect and Handle Bad Training Demonstrations in Learning From Demonstration.IEEE Transactions on Cognitive and Developmental Systems.2021,全文. *
利用红外夜视的海上弱小目标检测与分割;王勇;光学技术;20220715;全文 *

Also Published As

Publication number Publication date
CN116091787A (en) 2023-05-09

Similar Documents

Publication Publication Date Title
CN112396607B (en) Deformable convolution fusion enhanced street view image semantic segmentation method
CN111523546B (en) Image semantic segmentation method, system and computer storage medium
CN113033570B (en) Image semantic segmentation method for improving void convolution and multilevel characteristic information fusion
CN109583483A (en) A kind of object detection method and system based on convolutional neural networks
CN114495029B (en) Traffic target detection method and system based on improved YOLOv4
CN112131959A (en) 2D human body posture estimation method based on multi-scale feature reinforcement
CN112381733B (en) Image recovery-oriented multi-scale neural network structure searching method and network application
CN111652273A (en) Deep learning-based RGB-D image classification method
CN112149526B (en) Lane line detection method and system based on long-distance information fusion
CN115984701A (en) Multi-modal remote sensing image semantic segmentation method based on coding and decoding structure
CN112861727A (en) Real-time semantic segmentation method based on mixed depth separable convolution
CN115240079A (en) Multi-source remote sensing image depth feature fusion matching method
CN116468645A (en) Antagonistic hyperspectral multispectral remote sensing fusion method
CN114048845B (en) Point cloud repairing method and device, computer equipment and storage medium
CN115311502A (en) Remote sensing image small sample scene classification method based on multi-scale double-flow architecture
CN116863194A (en) Foot ulcer image classification method, system, equipment and medium
CN116092190A (en) Human body posture estimation method based on self-attention high-resolution network
CN110796182A (en) Bill classification method and system for small amount of samples
CN116091787B (en) Small sample target detection method based on feature filtering and feature alignment
CN115631513B (en) Transformer-based multi-scale pedestrian re-identification method
CN117011943A (en) Multi-scale self-attention mechanism-based decoupled 3D network action recognition method
CN116797640A (en) Depth and 3D key point estimation method for intelligent companion line inspection device
CN117011655A (en) Adaptive region selection feature fusion based method, target tracking method and system
CN113191367B (en) Semantic segmentation method based on dense scale dynamic network
CN115424012A (en) Lightweight image semantic segmentation method based on context information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant