CN113221987B

CN113221987B - Small sample target detection method based on cross attention mechanism

Info

Publication number: CN113221987B
Application number: CN202110482786.5A
Authority: CN
Inventors: 王鹏; 林蔚东; 邓玉岩
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2021-04-30
Filing date: 2021-04-30
Publication date: 2024-06-28
Anticipated expiration: 2041-04-30
Also published as: CN113221987A

Abstract

The invention discloses a small sample target detection method based on a cross attention mechanism, which constructs a small sample target detection model, wherein the model takes ResNet-50 models as a main network, inputs pictures to be detected and reference pictures into a shared main network to extract characteristics, expands the characteristics into two sequences, and inputs the two sequences into a cross attention module to perform characteristic fusion and enhancement to obtain characteristic diagrams of the pictures to be detected and the reference pictures; inputting the picture feature images to be detected into an RPN network to generate target candidate areas, and extracting the features of all the target candidate areas; and finally, carrying out global pooling operation on the reference picture features and the target candidate region features output by the cross attention module, merging the two feature vectors obtained by the pooling operation, and then sending the merged feature vectors into a classifier to obtain a classification result. The invention can better fuse the characteristics of the picture to be detected and the reference picture, improves the accuracy of the small sample detection model, and has good mobility.

Description

Small sample target detection method based on cross attention mechanism

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a small sample target detection method.

Background

The target detection task is always a leading edge research hot spot in the field of computer vision, and is also the basis of numerous top-level vision tasks. The main purpose of which is to identify and annotate target objects in an image or video content with boxes. In recent years, a target detection method based on a deep learning technology is rapidly developed, and has wide application in the fields of automatic driving, security protection, monitoring and the like. The current target detection method based on deep learning generally requires a large amount of manually marked data for training, and the model training can be deployed in actual detection application after the model training is completed. This approach suffers from two major problems, namely, the acquisition of manually labeled data is often costly, and the acquisition of large amounts of some data is often difficult or even impractical. Second, the trained model can only be used for detection of class objects present in the training data, and cannot adaptively detect new class objects that have not been seen during the training phase. These shortcomings have limited the development of target detection algorithms to a large extent and practical floor-standing applications.

In order to solve the problems, a small sample target detection method has been developed in recent years, and the small sample target detection algorithm aims to enable a target detection model to learn more powerful generalization performance on the existing large amount of marked data, and can learn new types of characteristics only through a small amount of marked samples, so that the small sample target detection method has the capability of detecting new types of targets. The application of the small sample target detection method can reduce the requirement on the labeling data volume, reduce the labor cost, reduce repeated training when a new category is added, and improve the efficiency.

Compared with a general target detection algorithm, the small sample target detection task is more challenging, and the main reason is that the number of new class labeling samples is very small, the model is difficult to learn the general distribution form of the new class features only through the data, and misclassification often occurs. The object detection task can be divided into two subtasks, namely, finding all areas with objects on the picture, marking the areas with a rectangular frame, and indicating the types of the objects marked by the rectangular frames. The first of these subtasks is easier because, in general, the features of the target differ significantly from the background features, and the second classification subtask is the most dominant factor limiting its effectiveness for small sample target detection.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a small sample target detection method based on a cross attention mechanism, which constructs a small sample target detection model, wherein the model takes ResNet-50 models as a main network, inputs pictures to be detected and reference pictures into a shared main network to extract characteristics, expands the characteristics into two sequences, and inputs the two sequences into a cross attention module to perform characteristic fusion and enhancement to obtain characteristic diagrams of the pictures to be detected and the reference pictures; inputting the picture feature images to be detected into an RPN network to generate target candidate areas, and extracting the features of all the target candidate areas; and finally, carrying out global pooling operation on the reference picture features and the target candidate region features output by the cross attention module, merging the two feature vectors obtained by the pooling operation, and then sending the merged feature vectors into a classifier to obtain a classification result. The invention can better fuse the characteristics of the picture to be detected and the reference picture, improves the accuracy of the small sample detection model, and has good mobility.

The technical scheme adopted by the invention for solving the technical problems comprises the following steps:

Step 1: constructing a small sample target detection model;

Step 1-1: respectively inputting a picture to be detected and a reference picture into a shared backbone network to extract characteristics, and acquiring initial characteristic images of the picture to be detected and the reference picture, wherein the initial characteristic images are W ' _t×W′_t and H ' qxW ' q respectively;

the reference picture is an image block only comprising a labeling target;

Step 1-2: the number of channels of the initial feature image of the picture to be detected is adjusted to 256 by adopting 3*3 convolution, the initial feature image is unfolded to form an initial feature image sequence of the picture to be detected, each element in the sequence is a 256-dimensional vector, and all channel dimension information of a point on the initial feature image of the picture to be detected corresponds to the vector;

The number of channels of the initial feature map of the reference picture is adjusted to 256 by adopting 1*1 convolution, and the initial feature map sequence of the reference picture is developed, wherein each element in the sequence is a 256-dimensional vector and corresponds to all channel dimension information of a point on the initial feature map of the reference picture;

Step 1-3: inputting an initial feature map sequence of the picture to be detected and an initial feature map sequence of the reference picture at the same time into an upper cross attention module for feature fusion and enhancement, wherein the output sequence of the upper cross attention module is the feature map of the picture to be detected and has the same shape as the initial feature map of the picture to be detected;

inputting an initial feature map sequence of a picture to be detected and an initial feature map sequence of a reference picture into a lower cross attention module at the same time for feature fusion and enhancement, wherein the output sequence of the lower cross attention module is a feature map of the reference picture, and the shape of the output sequence is the same as that of the initial feature map of the reference picture;

The upper cross attention module and the lower cross attention module have the same structure;

Step 1-4: inputting the picture feature map to be detected into an RPN network to generate a target candidate region p ₁,p₂,…,p_n, extracting the features of all target candidate regions by using an ROI alignment algorithm, and sending the features to a detection head to perform fine adjustment on candidate region frames, wherein the representation is as follows:

wherein bbox _i denotes the i-th target candidate region feature, Representing a network of head-of-detection regressors,Representing ROI align operation, F _t being a picture feature map to be detected, i=1, …, n;

and carrying out global pooling operation on the reference picture feature map and the target candidate region feature at the same time, merging two feature vectors obtained by the global pooling operation, and then sending the feature vectors into a classifier to obtain a classification result:

Wherein GAP (.) represents a global pooling operation, concat (.) represents a merging operation, Representing a classifier, wherein P (bbox _i) represents a class probability distribution prediction result of each target frame;

step 2: training a small sample target detection model;

Step 2-1: collecting pictures, preprocessing the pictures, adjusting the sizes of the pictures to be constant values, and then randomly overturning to generate a picture training set to be detected;

Scanning a picture training set to be detected, finding out a target in the picture to be detected, selecting and marking the target, and forming a target frame; dividing labels of all target frames according to categories to generate a list of a plurality of labels; counting all labels in the appointed training pictures, randomly selecting one of the classes of the labels, selecting a target frame corresponding to the class from a list of the labels as a reference picture, and setting the size of the reference picture as an appointed size;

Step 2-2: and optimizing the small sample target detection model by using an SGD algorithm, wherein a loss function used for training is consistent with FASTER RCNN, and the training is completed to obtain the final small sample target detection model.

Step 3: verifying the final small sample target detection model;

Giving a pair of a picture to be detected and a reference picture, marking all targets in the picture to be detected, which are consistent with the reference picture in type, wherein the category of the reference picture does not have any relevant marking data in a training stage, and the same category is marked with only one mark in a testing stage.

Preferably, the upper and lower cross-attention modules are constructed as follows:

The upper cross attention module is constructed based on a transducer, and is specifically described as follows:

Is provided with Representing an initial feature map sequence of the picture to be detected and an initial feature map sequence of the reference picture respectively, wherein N _t and N _q are lengths of the initial feature map sequence of the picture to be detected and the initial feature map sequence of the reference picture respectively, let Q=x _t,K＝V＝X_q, perform multi-head attention operation, and output Y _t:

Y_t＝Norm(X′_t+FFN(X′_t))

X′_t＝Norm(X_t+P_t+MultiHead(X_t+P_t,X_t+P_t,X_q))

MultiHead(Q,K,V)＝Concat(head₁,head₁,…,head_M)W^O

head_i＝Attention(QW_i ^Q,KW_i ^K,VW_i ^V)

FFN(x)＝max(0,xW₁+b₁)W₂+b₂

Wherein the method comprises the steps of For spatial position coding of the sequence X _t, using sine function calculation, performing layer normalization operation by Norm, enabling the trained gradient to be more stable, and enabling Y _t to represent an output result corresponding to an initial feature map sequence of a picture to be detected; q, K, V represent input sequences, all of which have the same dimension d _k; are all matrixes for calculating the values of Q, K and V in different subspaces, For projection matrix, M is the number of attention heads;

For the lower cross attention module, let q=x _q,K＝V＝X_t and perform the same operation as the upper cross attention module to obtain an output result of Y _q,Y_q as the reference picture initial feature map sequence.

Preferably, the backbone network is a ResNet-50 model.

Preferably, said m=8,d_m＝256。

The beneficial effects of the invention are as follows:

The method uses a cross attention mechanism to better fuse the characteristics of the picture to be detected and the reference picture, improves the accuracy of a small sample detection model, and has good mobility and plug and play.

Drawings

FIG. 1 is a schematic diagram of a small sample object detection model according to the present invention.

FIG. 2 is a cross-attention module visualization result of the present invention.

Detailed Description

The invention will be further described with reference to the drawings and examples.

The invention designs a small sample target detection method based on a cross attention mechanism, which enables a target detection model to adaptively learn meta-features of higher orders, and for a new category which does not appear in a training stage, all targets belonging to the category on a picture to be detected are adaptively detected through only one labeling sample. And the similarity between the picture to be detected and the reference picture characteristic is adaptively learned by using a cross attention mechanism, so that the final classification accuracy is improved. By using the method, the model can adaptively detect the targets of the category on the picture under the condition that only one category label exists, and a better detection effect is obtained.

The technical scheme of the invention is mainly described by two methods: overall model structure and cross-attention module design. The general model is constructed based on a classical two-stage target detection model Faster R-CNN, and the main flow comprises the processes of feature extraction, feature fusion based on a cross attention mechanism, target candidate region generation and adjustment and classification of a final target frame. The cross attention module is constructed based on a transducer, and the characteristics of the picture to be detected and the reference picture are respectively enhanced by utilizing two groups of parallel transducer encoders.

As shown in fig. 1, a small sample target detection method based on a cross-attention mechanism includes the following steps:

Step 1: constructing a small sample target detection model;

Step 1-1: respectively inputting a picture to be detected (TARGET IMAGE in the figure) and a reference picture (Query image in the figure, wherein the reference picture is an image block only containing a labeling target) into a shared main network to extract characteristics, and acquiring initial characteristic pictures of the picture to be detected and initial characteristic pictures of the reference picture, wherein the initial characteristic pictures are H '_t×W′_t and H' _q×W′_q respectively, and the main network is a ResNet-50 model of fast R-CNN;

In addition to the adjustment of the candidate region frames, the classification needs to be separated for all frames, unlike the general object detection, the model finally actually only performs one classification, because one detection only detects all objects in the picture which are consistent with the reference picture classification, and the rest are all used as background classifications.

Considering that the classification task depends on the characteristics of the reference picture, performing global pooling operation on the reference picture characteristic picture and the target candidate region characteristic at the same time, merging two characteristic vectors obtained by the global pooling operation, and sending the two characteristic vectors into a classifier to obtain a classification result:

Step 1-5: the Cross-attention module is the core module of the present model, and is constructed based on a transducer, namely the Cross-Attention Transformer (CAT) module in fig. 1, and one layer of the Cross-attention module is composed of two parallel transducers. The attention mechanism of the transducer is very suitable for learning the relation among sequence elements and has very good long-distance modeling capability, and the core operation can be expressed as

Where Q, K, V denote the input sequence, and their elements all have the same dimension d _k. One major difference between the Transformer and the general Attention mechanism is that it uses a Multi-Head Attention mechanism, i.e. the inputs are mapped into different subspaces, then the Attention operations are performed separately, and finally all the results are combined and output, i.e.

MultiHead(Q,K,V)＝Concat(head₁,head₁,…,head_M)W^O

head_i＝Attention(QW_i ^Q,KW_i ^K,VW_i ^V)

Wherein the method comprises the steps ofAre all matrixes for calculating the values of Q, K and V in different subspaces,For the projection matrix, M is the number of attention heads. In this model m=8 is taken, D _m =256. After multi-head attention operation, the output is sent into an FFN module formed by two fully connected layers to obtain the final result

FFN(x)＝max(0,xW₁+b₁)W₂+b₂

There is a ReLU activation function between the two layers to achieve the effect of nonlinearity.

The above is the basic operation of the generic cross-attention module, in one layer of the cross-attention module on the present model, if providedRepresenting an initial feature map sequence of the picture to be detected and an initial feature map sequence of the reference picture respectively, wherein N _t and N _q are lengths of the initial feature map sequence of the picture to be detected and the initial feature map sequence of the reference picture respectively, and let Q=x _t,K＝V＝X_q, and then performing multi-head attention operation, namely

Y_t＝Norm(X′_t+FFN(X′_t))

X′_t＝Norm(X_t+P_t+MultiHead(X_t+P_t,X_t+P_t,X_q))

Wherein the method comprises the steps ofFor spatial position coding of sequence X _t, using sine function calculation, norm represents layer normalization operation, so that the trained gradient is more stable, and Y _t represents output results corresponding to the picture features to be detected. Let q=x _q,K＝V＝X_t and do the same for the lower cross-attention module to obtain Y _q, which then corresponds to the output of the reference picture initial feature map sequence. All the operations described above are one layer of the cross-attention module, and in practice N identical layers will be superimposed to achieve further enhancement of both features. The Y _t and Y _q output after the N layers are feature sequences fused and enhanced by a cross-attention mechanism, and the feature sequences are reformed into the shape of a feature graph and then used for subsequent target detection regression and classification tasks.

Step 2: training a small sample target detection model;

Step 3: verifying the final small sample target detection model;

The feature sequence input to the cross attention module actually corresponds to the features of the partitioned areas on the original picture, and the areas with similar features between the picture to be detected and the reference picture can be enhanced through the cross attention operation, so that the model can adaptively sense the areas which are possibly in the same category as the reference picture on the picture to be detected, and the visual result also proves the viewpoint. As shown in fig. 2, the first column represents a reference picture (Query image), the second column represents a picture to be detected (TARGET IMAGE), the third column represents a feature map visualization result output after the picture to be detected passes through the backbone network, and the fourth column represents a visualization result after the feature map passes through the cross-attention module. It is easy to see that after the cross attention module, the response of the region of the same category as the reference picture on the feature map of the picture to be detected is obviously higher than that of other regions.

Specific examples:

the invention provides a small sample target detection method based on a cross attention mechanism, which is characterized in that the whole method flow is divided into two parts, namely a network training stage and a network testing stage, and the whole method is improved based on a two-stage target detection framework Faster R-CNN.

1. And (3) a network training stage:

The training stage firstly needs to preprocess the picture, a picture for training is given, the size of the picture is adjusted by a bilinear interpolation method, the shorter side of the picture is adjusted to 600, the proportion is kept, the longer side is not more than 1000, and the picture is randomly turned over according to the probability of 0.5 in the training process. And secondly, generating a reference picture, scanning the whole data set, dividing labels of all target frames in the whole data set according to categories, and generating a plurality of label lists. For a specific training picture, counting all real labels, randomly selecting one category, selecting one picture from a corresponding category label list generated in the prior art, cutting the picture according to the position and the shape of a label frame to be used as a reference picture, and uniformly reforming the size of the input reference picture into 128x 128. The formal training is that the model is optimized using the SGD algorithm, the initial learning rate is set to 0.01, the batch size is set to 16, and the learning rate is attenuated to one tenth of the original at the 5 th and 9 th epochs. The loss function used for training is consistent with FASTER RCNN, and the model obtained after training is used for testing.

2. Network testing stage:

The test phase is used to ultimately verify the validity of the designed cross-attention structure and the performance of the overall model. The evaluation setting of the part is mainly to set any picture pair (to-be-detected picture and reference picture), the model needs to mark all targets which are consistent with the types of the reference pictures in the to-be-detected picture, and the type of the reference picture does not have any relevant marking data in the training stage, and the same type of the reference picture is marked with only one type of marking data in the testing stage. The size adjustment method of the input picture in the test stage is consistent with that in the training stage.

Claims

1. A method for detecting a small sample target based on a cross-attention mechanism, comprising the steps of:

Step 1: constructing a small sample target detection model;

Step 1-1: respectively inputting a picture to be detected and a reference picture into a shared backbone network to extract characteristics, and acquiring initial characteristic images of the picture to be detected and the reference picture, wherein the initial characteristic images are H '_t×W′_t and H' _q×W′_q respectively;

the reference picture is an image block only comprising a labeling target;

The upper and lower cross attention modules are constructed as follows:

Y_t＝Norm(X′_t+FFN(X′_t))

X′_t＝Norm(X_t+P_t+MultiHead(X_t+P_t,X_t+P_t,X_q))

MultiHead(Q，K，V)＝Concat(head₁,head₁,......,head_M)W^O

head_i＝Attention(QW_i ^Q,KW_i ^K,VW_i ^V)

FFN(x)＝max(0,xW₁+b₁)W₂+b₂

For the lower cross attention module, let q=x _q,K＝V＝X_t and perform the same operation as the upper cross attention module to obtain an output result of Y _q,Y_q as the initial feature map sequence of the reference picture;

step 2: training a small sample target detection model;

Step 2-2: optimizing the small sample target detection model by using an SGD algorithm, training to obtain a final small sample target detection model by training, wherein a loss function used by training is consistent with FASTER RCNN;

Step 3: verifying the final small sample target detection model;

2. The method of claim 1, wherein the backbone network is a ResNet-50 model.

3. The method for small sample object detection based on cross-attention mechanism of claim 1, wherein m=8,d_m＝256。