CN113221987B - Small sample target detection method based on cross attention mechanism - Google Patents

Small sample target detection method based on cross attention mechanism Download PDF

Info

Publication number
CN113221987B
CN113221987B CN202110482786.5A CN202110482786A CN113221987B CN 113221987 B CN113221987 B CN 113221987B CN 202110482786 A CN202110482786 A CN 202110482786A CN 113221987 B CN113221987 B CN 113221987B
Authority
CN
China
Prior art keywords
picture
detected
reference picture
feature map
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110482786.5A
Other languages
Chinese (zh)
Other versions
CN113221987A (en
Inventor
王鹏
林蔚东
邓玉岩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN202110482786.5A priority Critical patent/CN113221987B/en
Publication of CN113221987A publication Critical patent/CN113221987A/en
Application granted granted Critical
Publication of CN113221987B publication Critical patent/CN113221987B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Molecular Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a small sample target detection method based on a cross attention mechanism, which constructs a small sample target detection model, wherein the model takes ResNet-50 models as a main network, inputs pictures to be detected and reference pictures into a shared main network to extract characteristics, expands the characteristics into two sequences, and inputs the two sequences into a cross attention module to perform characteristic fusion and enhancement to obtain characteristic diagrams of the pictures to be detected and the reference pictures; inputting the picture feature images to be detected into an RPN network to generate target candidate areas, and extracting the features of all the target candidate areas; and finally, carrying out global pooling operation on the reference picture features and the target candidate region features output by the cross attention module, merging the two feature vectors obtained by the pooling operation, and then sending the merged feature vectors into a classifier to obtain a classification result. The invention can better fuse the characteristics of the picture to be detected and the reference picture, improves the accuracy of the small sample detection model, and has good mobility.

Description

Small sample target detection method based on cross attention mechanism
Technical Field
The invention belongs to the technical field of computer vision, and particularly relates to a small sample target detection method.
Background
The target detection task is always a leading edge research hot spot in the field of computer vision, and is also the basis of numerous top-level vision tasks. The main purpose of which is to identify and annotate target objects in an image or video content with boxes. In recent years, a target detection method based on a deep learning technology is rapidly developed, and has wide application in the fields of automatic driving, security protection, monitoring and the like. The current target detection method based on deep learning generally requires a large amount of manually marked data for training, and the model training can be deployed in actual detection application after the model training is completed. This approach suffers from two major problems, namely, the acquisition of manually labeled data is often costly, and the acquisition of large amounts of some data is often difficult or even impractical. Second, the trained model can only be used for detection of class objects present in the training data, and cannot adaptively detect new class objects that have not been seen during the training phase. These shortcomings have limited the development of target detection algorithms to a large extent and practical floor-standing applications.
In order to solve the problems, a small sample target detection method has been developed in recent years, and the small sample target detection algorithm aims to enable a target detection model to learn more powerful generalization performance on the existing large amount of marked data, and can learn new types of characteristics only through a small amount of marked samples, so that the small sample target detection method has the capability of detecting new types of targets. The application of the small sample target detection method can reduce the requirement on the labeling data volume, reduce the labor cost, reduce repeated training when a new category is added, and improve the efficiency.
Compared with a general target detection algorithm, the small sample target detection task is more challenging, and the main reason is that the number of new class labeling samples is very small, the model is difficult to learn the general distribution form of the new class features only through the data, and misclassification often occurs. The object detection task can be divided into two subtasks, namely, finding all areas with objects on the picture, marking the areas with a rectangular frame, and indicating the types of the objects marked by the rectangular frames. The first of these subtasks is easier because, in general, the features of the target differ significantly from the background features, and the second classification subtask is the most dominant factor limiting its effectiveness for small sample target detection.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a small sample target detection method based on a cross attention mechanism, which constructs a small sample target detection model, wherein the model takes ResNet-50 models as a main network, inputs pictures to be detected and reference pictures into a shared main network to extract characteristics, expands the characteristics into two sequences, and inputs the two sequences into a cross attention module to perform characteristic fusion and enhancement to obtain characteristic diagrams of the pictures to be detected and the reference pictures; inputting the picture feature images to be detected into an RPN network to generate target candidate areas, and extracting the features of all the target candidate areas; and finally, carrying out global pooling operation on the reference picture features and the target candidate region features output by the cross attention module, merging the two feature vectors obtained by the pooling operation, and then sending the merged feature vectors into a classifier to obtain a classification result. The invention can better fuse the characteristics of the picture to be detected and the reference picture, improves the accuracy of the small sample detection model, and has good mobility.
The technical scheme adopted by the invention for solving the technical problems comprises the following steps:
Step 1: constructing a small sample target detection model;
Step 1-1: respectively inputting a picture to be detected and a reference picture into a shared backbone network to extract characteristics, and acquiring initial characteristic images of the picture to be detected and the reference picture, wherein the initial characteristic images are W ' t×W′t and H ' qxW ' q respectively;
the reference picture is an image block only comprising a labeling target;
Step 1-2: the number of channels of the initial feature image of the picture to be detected is adjusted to 256 by adopting 3*3 convolution, the initial feature image is unfolded to form an initial feature image sequence of the picture to be detected, each element in the sequence is a 256-dimensional vector, and all channel dimension information of a point on the initial feature image of the picture to be detected corresponds to the vector;
The number of channels of the initial feature map of the reference picture is adjusted to 256 by adopting 1*1 convolution, and the initial feature map sequence of the reference picture is developed, wherein each element in the sequence is a 256-dimensional vector and corresponds to all channel dimension information of a point on the initial feature map of the reference picture;
Step 1-3: inputting an initial feature map sequence of the picture to be detected and an initial feature map sequence of the reference picture at the same time into an upper cross attention module for feature fusion and enhancement, wherein the output sequence of the upper cross attention module is the feature map of the picture to be detected and has the same shape as the initial feature map of the picture to be detected;
inputting an initial feature map sequence of a picture to be detected and an initial feature map sequence of a reference picture into a lower cross attention module at the same time for feature fusion and enhancement, wherein the output sequence of the lower cross attention module is a feature map of the reference picture, and the shape of the output sequence is the same as that of the initial feature map of the reference picture;
The upper cross attention module and the lower cross attention module have the same structure;
Step 1-4: inputting the picture feature map to be detected into an RPN network to generate a target candidate region p 1,p2,…,pn, extracting the features of all target candidate regions by using an ROI alignment algorithm, and sending the features to a detection head to perform fine adjustment on candidate region frames, wherein the representation is as follows:
wherein bbox i denotes the i-th target candidate region feature, Representing a network of head-of-detection regressors,Representing ROI align operation, F t being a picture feature map to be detected, i=1, …, n;
and carrying out global pooling operation on the reference picture feature map and the target candidate region feature at the same time, merging two feature vectors obtained by the global pooling operation, and then sending the feature vectors into a classifier to obtain a classification result:
Wherein GAP (.) represents a global pooling operation, concat (.) represents a merging operation, Representing a classifier, wherein P (bbox i) represents a class probability distribution prediction result of each target frame;
step 2: training a small sample target detection model;
Step 2-1: collecting pictures, preprocessing the pictures, adjusting the sizes of the pictures to be constant values, and then randomly overturning to generate a picture training set to be detected;
Scanning a picture training set to be detected, finding out a target in the picture to be detected, selecting and marking the target, and forming a target frame; dividing labels of all target frames according to categories to generate a list of a plurality of labels; counting all labels in the appointed training pictures, randomly selecting one of the classes of the labels, selecting a target frame corresponding to the class from a list of the labels as a reference picture, and setting the size of the reference picture as an appointed size;
Step 2-2: and optimizing the small sample target detection model by using an SGD algorithm, wherein a loss function used for training is consistent with FASTER RCNN, and the training is completed to obtain the final small sample target detection model.
Step 3: verifying the final small sample target detection model;
Giving a pair of a picture to be detected and a reference picture, marking all targets in the picture to be detected, which are consistent with the reference picture in type, wherein the category of the reference picture does not have any relevant marking data in a training stage, and the same category is marked with only one mark in a testing stage.
Preferably, the upper and lower cross-attention modules are constructed as follows:
The upper cross attention module is constructed based on a transducer, and is specifically described as follows:
Is provided with Representing an initial feature map sequence of the picture to be detected and an initial feature map sequence of the reference picture respectively, wherein N t and N q are lengths of the initial feature map sequence of the picture to be detected and the initial feature map sequence of the reference picture respectively, let Q=x t,K=V=Xq, perform multi-head attention operation, and output Y t:
Yt=Norm(X′t+FFN(X′t))
X′t=Norm(Xt+Pt+MultiHead(Xt+Pt,Xt+Pt,Xq))
MultiHead(Q,K,V)=Concat(head1,head1,…,headM)WO
headi=Attention(QWi Q,KWi K,VWi V)
FFN(x)=max(0,xW1+b1)W2+b2
Wherein the method comprises the steps of For spatial position coding of the sequence X t, using sine function calculation, performing layer normalization operation by Norm, enabling the trained gradient to be more stable, and enabling Y t to represent an output result corresponding to an initial feature map sequence of a picture to be detected; q, K, V represent input sequences, all of which have the same dimension d k; are all matrixes for calculating the values of Q, K and V in different subspaces, For projection matrix, M is the number of attention heads;
For the lower cross attention module, let q=x q,K=V=Xt and perform the same operation as the upper cross attention module to obtain an output result of Y q,Yq as the reference picture initial feature map sequence.
Preferably, the backbone network is a ResNet-50 model.
Preferably, said m=8,dm=256。
The beneficial effects of the invention are as follows:
The method uses a cross attention mechanism to better fuse the characteristics of the picture to be detected and the reference picture, improves the accuracy of a small sample detection model, and has good mobility and plug and play.
Drawings
FIG. 1 is a schematic diagram of a small sample object detection model according to the present invention.
FIG. 2 is a cross-attention module visualization result of the present invention.
Detailed Description
The invention will be further described with reference to the drawings and examples.
The invention designs a small sample target detection method based on a cross attention mechanism, which enables a target detection model to adaptively learn meta-features of higher orders, and for a new category which does not appear in a training stage, all targets belonging to the category on a picture to be detected are adaptively detected through only one labeling sample. And the similarity between the picture to be detected and the reference picture characteristic is adaptively learned by using a cross attention mechanism, so that the final classification accuracy is improved. By using the method, the model can adaptively detect the targets of the category on the picture under the condition that only one category label exists, and a better detection effect is obtained.
The technical scheme of the invention is mainly described by two methods: overall model structure and cross-attention module design. The general model is constructed based on a classical two-stage target detection model Faster R-CNN, and the main flow comprises the processes of feature extraction, feature fusion based on a cross attention mechanism, target candidate region generation and adjustment and classification of a final target frame. The cross attention module is constructed based on a transducer, and the characteristics of the picture to be detected and the reference picture are respectively enhanced by utilizing two groups of parallel transducer encoders.
As shown in fig. 1, a small sample target detection method based on a cross-attention mechanism includes the following steps:
Step 1: constructing a small sample target detection model;
Step 1-1: respectively inputting a picture to be detected (TARGET IMAGE in the figure) and a reference picture (Query image in the figure, wherein the reference picture is an image block only containing a labeling target) into a shared main network to extract characteristics, and acquiring initial characteristic pictures of the picture to be detected and initial characteristic pictures of the reference picture, wherein the initial characteristic pictures are H 't×W′t and H' q×W′q respectively, and the main network is a ResNet-50 model of fast R-CNN;
Step 1-2: the number of channels of the initial feature image of the picture to be detected is adjusted to 256 by adopting 3*3 convolution, the initial feature image is unfolded to form an initial feature image sequence of the picture to be detected, each element in the sequence is a 256-dimensional vector, and all channel dimension information of a point on the initial feature image of the picture to be detected corresponds to the vector;
The number of channels of the initial feature map of the reference picture is adjusted to 256 by adopting 1*1 convolution, and the initial feature map sequence of the reference picture is developed, wherein each element in the sequence is a 256-dimensional vector and corresponds to all channel dimension information of a point on the initial feature map of the reference picture;
Step 1-3: inputting an initial feature map sequence of the picture to be detected and an initial feature map sequence of the reference picture at the same time into an upper cross attention module for feature fusion and enhancement, wherein the output sequence of the upper cross attention module is the feature map of the picture to be detected and has the same shape as the initial feature map of the picture to be detected;
inputting an initial feature map sequence of a picture to be detected and an initial feature map sequence of a reference picture into a lower cross attention module at the same time for feature fusion and enhancement, wherein the output sequence of the lower cross attention module is a feature map of the reference picture, and the shape of the output sequence is the same as that of the initial feature map of the reference picture;
The upper cross attention module and the lower cross attention module have the same structure;
Step 1-4: inputting the picture feature map to be detected into an RPN network to generate a target candidate region p 1,p2,…,pn, extracting the features of all target candidate regions by using an ROI alignment algorithm, and sending the features to a detection head to perform fine adjustment on candidate region frames, wherein the representation is as follows:
wherein bbox i denotes the i-th target candidate region feature, Representing a network of head-of-detection regressors,Representing ROI align operation, F t being a picture feature map to be detected, i=1, …, n;
In addition to the adjustment of the candidate region frames, the classification needs to be separated for all frames, unlike the general object detection, the model finally actually only performs one classification, because one detection only detects all objects in the picture which are consistent with the reference picture classification, and the rest are all used as background classifications.
Considering that the classification task depends on the characteristics of the reference picture, performing global pooling operation on the reference picture characteristic picture and the target candidate region characteristic at the same time, merging two characteristic vectors obtained by the global pooling operation, and sending the two characteristic vectors into a classifier to obtain a classification result:
Wherein GAP (.) represents a global pooling operation, concat (.) represents a merging operation, Representing a classifier, wherein P (bbox i) represents a class probability distribution prediction result of each target frame;
Step 1-5: the Cross-attention module is the core module of the present model, and is constructed based on a transducer, namely the Cross-Attention Transformer (CAT) module in fig. 1, and one layer of the Cross-attention module is composed of two parallel transducers. The attention mechanism of the transducer is very suitable for learning the relation among sequence elements and has very good long-distance modeling capability, and the core operation can be expressed as
Where Q, K, V denote the input sequence, and their elements all have the same dimension d k. One major difference between the Transformer and the general Attention mechanism is that it uses a Multi-Head Attention mechanism, i.e. the inputs are mapped into different subspaces, then the Attention operations are performed separately, and finally all the results are combined and output, i.e.
MultiHead(Q,K,V)=Concat(head1,head1,…,headM)WO
headi=Attention(QWi Q,KWi K,VWi V)
Wherein the method comprises the steps ofAre all matrixes for calculating the values of Q, K and V in different subspaces,For the projection matrix, M is the number of attention heads. In this model m=8 is taken, D m =256. After multi-head attention operation, the output is sent into an FFN module formed by two fully connected layers to obtain the final result
FFN(x)=max(0,xW1+b1)W2+b2
There is a ReLU activation function between the two layers to achieve the effect of nonlinearity.
The above is the basic operation of the generic cross-attention module, in one layer of the cross-attention module on the present model, if providedRepresenting an initial feature map sequence of the picture to be detected and an initial feature map sequence of the reference picture respectively, wherein N t and N q are lengths of the initial feature map sequence of the picture to be detected and the initial feature map sequence of the reference picture respectively, and let Q=x t,K=V=Xq, and then performing multi-head attention operation, namely
Yt=Norm(X′t+FFN(X′t))
X′t=Norm(Xt+Pt+MultiHead(Xt+Pt,Xt+Pt,Xq))
Wherein the method comprises the steps ofFor spatial position coding of sequence X t, using sine function calculation, norm represents layer normalization operation, so that the trained gradient is more stable, and Y t represents output results corresponding to the picture features to be detected. Let q=x q,K=V=Xt and do the same for the lower cross-attention module to obtain Y q, which then corresponds to the output of the reference picture initial feature map sequence. All the operations described above are one layer of the cross-attention module, and in practice N identical layers will be superimposed to achieve further enhancement of both features. The Y t and Y q output after the N layers are feature sequences fused and enhanced by a cross-attention mechanism, and the feature sequences are reformed into the shape of a feature graph and then used for subsequent target detection regression and classification tasks.
Step 2: training a small sample target detection model;
Step 2-1: collecting pictures, preprocessing the pictures, adjusting the sizes of the pictures to be constant values, and then randomly overturning to generate a picture training set to be detected;
Scanning a picture training set to be detected, finding out a target in the picture to be detected, selecting and marking the target, and forming a target frame; dividing labels of all target frames according to categories to generate a list of a plurality of labels; counting all labels in the appointed training pictures, randomly selecting one of the classes of the labels, selecting a target frame corresponding to the class from a list of the labels as a reference picture, and setting the size of the reference picture as an appointed size;
Step 2-2: and optimizing the small sample target detection model by using an SGD algorithm, wherein a loss function used for training is consistent with FASTER RCNN, and the training is completed to obtain the final small sample target detection model.
Step 3: verifying the final small sample target detection model;
Giving a pair of a picture to be detected and a reference picture, marking all targets in the picture to be detected, which are consistent with the reference picture in type, wherein the category of the reference picture does not have any relevant marking data in a training stage, and the same category is marked with only one mark in a testing stage.
The feature sequence input to the cross attention module actually corresponds to the features of the partitioned areas on the original picture, and the areas with similar features between the picture to be detected and the reference picture can be enhanced through the cross attention operation, so that the model can adaptively sense the areas which are possibly in the same category as the reference picture on the picture to be detected, and the visual result also proves the viewpoint. As shown in fig. 2, the first column represents a reference picture (Query image), the second column represents a picture to be detected (TARGET IMAGE), the third column represents a feature map visualization result output after the picture to be detected passes through the backbone network, and the fourth column represents a visualization result after the feature map passes through the cross-attention module. It is easy to see that after the cross attention module, the response of the region of the same category as the reference picture on the feature map of the picture to be detected is obviously higher than that of other regions.
Specific examples:
the invention provides a small sample target detection method based on a cross attention mechanism, which is characterized in that the whole method flow is divided into two parts, namely a network training stage and a network testing stage, and the whole method is improved based on a two-stage target detection framework Faster R-CNN.
1. And (3) a network training stage:
The training stage firstly needs to preprocess the picture, a picture for training is given, the size of the picture is adjusted by a bilinear interpolation method, the shorter side of the picture is adjusted to 600, the proportion is kept, the longer side is not more than 1000, and the picture is randomly turned over according to the probability of 0.5 in the training process. And secondly, generating a reference picture, scanning the whole data set, dividing labels of all target frames in the whole data set according to categories, and generating a plurality of label lists. For a specific training picture, counting all real labels, randomly selecting one category, selecting one picture from a corresponding category label list generated in the prior art, cutting the picture according to the position and the shape of a label frame to be used as a reference picture, and uniformly reforming the size of the input reference picture into 128x 128. The formal training is that the model is optimized using the SGD algorithm, the initial learning rate is set to 0.01, the batch size is set to 16, and the learning rate is attenuated to one tenth of the original at the 5 th and 9 th epochs. The loss function used for training is consistent with FASTER RCNN, and the model obtained after training is used for testing.
2. Network testing stage:
The test phase is used to ultimately verify the validity of the designed cross-attention structure and the performance of the overall model. The evaluation setting of the part is mainly to set any picture pair (to-be-detected picture and reference picture), the model needs to mark all targets which are consistent with the types of the reference pictures in the to-be-detected picture, and the type of the reference picture does not have any relevant marking data in the training stage, and the same type of the reference picture is marked with only one type of marking data in the testing stage. The size adjustment method of the input picture in the test stage is consistent with that in the training stage.

Claims (3)

1. A method for detecting a small sample target based on a cross-attention mechanism, comprising the steps of:
Step 1: constructing a small sample target detection model;
Step 1-1: respectively inputting a picture to be detected and a reference picture into a shared backbone network to extract characteristics, and acquiring initial characteristic images of the picture to be detected and the reference picture, wherein the initial characteristic images are H 't×W′t and H' q×W′q respectively;
the reference picture is an image block only comprising a labeling target;
Step 1-2: the number of channels of the initial feature image of the picture to be detected is adjusted to 256 by adopting 3*3 convolution, the initial feature image is unfolded to form an initial feature image sequence of the picture to be detected, each element in the sequence is a 256-dimensional vector, and all channel dimension information of a point on the initial feature image of the picture to be detected corresponds to the vector;
The number of channels of the initial feature map of the reference picture is adjusted to 256 by adopting 1*1 convolution, and the initial feature map sequence of the reference picture is developed, wherein each element in the sequence is a 256-dimensional vector and corresponds to all channel dimension information of a point on the initial feature map of the reference picture;
Step 1-3: inputting an initial feature map sequence of the picture to be detected and an initial feature map sequence of the reference picture at the same time into an upper cross attention module for feature fusion and enhancement, wherein the output sequence of the upper cross attention module is the feature map of the picture to be detected and has the same shape as the initial feature map of the picture to be detected;
inputting an initial feature map sequence of a picture to be detected and an initial feature map sequence of a reference picture into a lower cross attention module at the same time for feature fusion and enhancement, wherein the output sequence of the lower cross attention module is a feature map of the reference picture, and the shape of the output sequence is the same as that of the initial feature map of the reference picture;
The upper cross attention module and the lower cross attention module have the same structure;
The upper and lower cross attention modules are constructed as follows:
The upper cross attention module is constructed based on a transducer, and is specifically described as follows:
Is provided with Representing an initial feature map sequence of the picture to be detected and an initial feature map sequence of the reference picture respectively, wherein N t and N q are lengths of the initial feature map sequence of the picture to be detected and the initial feature map sequence of the reference picture respectively, let Q=x t,K=V=Xq, perform multi-head attention operation, and output Y t:
Yt=Norm(X′t+FFN(X′t))
X′t=Norm(Xt+Pt+MultiHead(Xt+Pt,Xt+Pt,Xq))
MultiHead(Q,K,V)=Concat(head1,head1,......,headM)WO
headi=Attention(QWi Q,KWi K,VWi V)
FFN(x)=max(0,xW1+b1)W2+b2
Wherein the method comprises the steps of For spatial position coding of the sequence X t, using sine function calculation, performing layer normalization operation by Norm, enabling the trained gradient to be more stable, and enabling Y t to represent an output result corresponding to an initial feature map sequence of a picture to be detected; q, K, V represent input sequences, all of which have the same dimension d k; are all matrixes for calculating the values of Q, K and V in different subspaces, For projection matrix, M is the number of attention heads;
For the lower cross attention module, let q=x q,K=V=Xt and perform the same operation as the upper cross attention module to obtain an output result of Y q,Yq as the initial feature map sequence of the reference picture;
Step 1-4: inputting the picture feature map to be detected into an RPN network to generate a target candidate region p 1,p2,…,pn, extracting the features of all target candidate regions by using an ROI alignment algorithm, and sending the features to a detection head to perform fine adjustment on candidate region frames, wherein the representation is as follows:
wherein bbox i denotes the i-th target candidate region feature, Representing a network of head-of-detection regressors,Representing ROI align operation, F t being a picture feature map to be detected, i=1, …, n;
and carrying out global pooling operation on the reference picture feature map and the target candidate region feature at the same time, merging two feature vectors obtained by the global pooling operation, and then sending the feature vectors into a classifier to obtain a classification result:
Wherein GAP (.) represents a global pooling operation, concat (.) represents a merging operation, Representing a classifier, wherein P (bbox i) represents a class probability distribution prediction result of each target frame;
step 2: training a small sample target detection model;
Step 2-1: collecting pictures, preprocessing the pictures, adjusting the sizes of the pictures to be constant values, and then randomly overturning to generate a picture training set to be detected;
Scanning a picture training set to be detected, finding out a target in the picture to be detected, selecting and marking the target, and forming a target frame; dividing labels of all target frames according to categories to generate a list of a plurality of labels; counting all labels in the appointed training pictures, randomly selecting one of the classes of the labels, selecting a target frame corresponding to the class from a list of the labels as a reference picture, and setting the size of the reference picture as an appointed size;
Step 2-2: optimizing the small sample target detection model by using an SGD algorithm, training to obtain a final small sample target detection model by training, wherein a loss function used by training is consistent with FASTER RCNN;
Step 3: verifying the final small sample target detection model;
Giving a pair of a picture to be detected and a reference picture, marking all targets in the picture to be detected, which are consistent with the reference picture in type, wherein the category of the reference picture does not have any relevant marking data in a training stage, and the same category is marked with only one mark in a testing stage.
2. The method of claim 1, wherein the backbone network is a ResNet-50 model.
3. The method for small sample object detection based on cross-attention mechanism of claim 1, wherein m=8,dm=256。
CN202110482786.5A 2021-04-30 2021-04-30 Small sample target detection method based on cross attention mechanism Active CN113221987B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110482786.5A CN113221987B (en) 2021-04-30 2021-04-30 Small sample target detection method based on cross attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110482786.5A CN113221987B (en) 2021-04-30 2021-04-30 Small sample target detection method based on cross attention mechanism

Publications (2)

Publication Number Publication Date
CN113221987A CN113221987A (en) 2021-08-06
CN113221987B true CN113221987B (en) 2024-06-28

Family

ID=77090554

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110482786.5A Active CN113221987B (en) 2021-04-30 2021-04-30 Small sample target detection method based on cross attention mechanism

Country Status (1)

Country Link
CN (1) CN113221987B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113822368B (en) * 2021-09-29 2023-06-20 成都信息工程大学 Anchor-free incremental target detection method
CN114663431B (en) * 2022-05-19 2022-08-30 浙江大学 Pancreatic tumor image segmentation method and system based on reinforcement learning and attention
CN116091787B (en) * 2022-10-08 2024-06-18 中南大学 Small sample target detection method based on feature filtering and feature alignment
CN116205916B (en) * 2023-04-28 2023-09-15 南方电网数字电网研究院有限公司 Method and device for detecting defects of small electric power sample, computer equipment and storage medium
CN116597384B (en) * 2023-06-02 2024-03-05 中国人民解放军国防科技大学 Space target identification method and device based on small sample training and computer equipment
CN117953206A (en) * 2024-03-25 2024-04-30 厦门大学 Mixed supervision target detection method and device based on point labeling guidance

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107515895B (en) * 2017-07-14 2020-06-05 中国科学院计算技术研究所 Visual target retrieval method and system based on target detection
CN110211097B (en) * 2019-05-14 2021-06-08 河海大学 Crack image detection method based on fast R-CNN parameter migration
CN111259940B (en) * 2020-01-10 2023-04-07 杭州电子科技大学 Target detection method based on space attention map
US10970598B1 (en) * 2020-05-13 2021-04-06 StradVision, Inc. Learning method and learning device for training an object detection network by using attention maps and testing method and testing device using the same
CN112150504A (en) * 2020-08-03 2020-12-29 上海大学 Visual tracking method based on attention mechanism
CN112488999B (en) * 2020-11-19 2024-04-05 特斯联科技集团有限公司 Small target detection method, small target detection system, storage medium and terminal

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
融合双注意力的深度神经网络在无人机目标检测中的应用;占哲琦;陈鹏;桑永胜;彭德中;;现代计算机;20200415(第11期);全文 *
视觉单目标跟踪算法综述;汤一明;刘玉菲;黄鸿;;测控技术;20200818(第08期);全文 *

Also Published As

Publication number Publication date
CN113221987A (en) 2021-08-06

Similar Documents

Publication Publication Date Title
CN113221987B (en) Small sample target detection method based on cross attention mechanism
CN111612807A (en) Small target image segmentation method based on scale and edge information
CN107784288A (en) A kind of iteration positioning formula method for detecting human face based on deep neural network
CN111652273B (en) Deep learning-based RGB-D image classification method
Huang et al. Qualitynet: Segmentation quality evaluation with deep convolutional networks
CN113112416B (en) Semantic-guided face image restoration method
CN113313149B (en) Dish identification method based on attention mechanism and metric learning
CN112991364A (en) Road scene semantic segmentation method based on convolution neural network cross-modal fusion
CN114626476A (en) Bird fine-grained image recognition method and device based on Transformer and component feature fusion
CN114332473A (en) Object detection method, object detection device, computer equipment, storage medium and program product
CN111222530A (en) Fine-grained image classification method, system, device and storage medium
Avola et al. Real-time deep learning method for automated detection and localization of structural defects in manufactured products
CN110851627B (en) Method for describing sun black subgroup in full-sun image
CN115410059A (en) Remote sensing image part supervision change detection method and device based on contrast loss
Shit et al. An encoder‐decoder based CNN architecture using end to end dehaze and detection network for proper image visualization and detection
CN113486715A (en) Image reproduction identification method, intelligent terminal and computer storage medium
CN116503398B (en) Insulator pollution flashover detection method and device, electronic equipment and storage medium
CN116486228A (en) Paper medicine box steel seal character recognition method based on improved YOLOV5 model
CN115861284A (en) Printed product linear defect detection method, printed product linear defect detection device, electronic equipment and storage medium
Zheng et al. Visual chirality meets freehand sketches
CN115205877A (en) Irregular typesetting invoice document layout prediction method and device and storage medium
Dong et al. Intelligent pixel-level pavement marking detection using 2D laser pavement images
CN114897901B (en) Battery quality detection method and device based on sample expansion and electronic equipment
CN111340111B (en) Method for recognizing face image set based on wavelet kernel extreme learning machine
CN117593514B (en) Image target detection method and system based on deep principal component analysis assistance

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant