CN113822368B - Anchor-free incremental target detection method - Google Patents

Anchor-free incremental target detection method Download PDF

Info

Publication number
CN113822368B
CN113822368B CN202111153974.XA CN202111153974A CN113822368B CN 113822368 B CN113822368 B CN 113822368B CN 202111153974 A CN202111153974 A CN 202111153974A CN 113822368 B CN113822368 B CN 113822368B
Authority
CN
China
Prior art keywords
class
training
target
specific
meta
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111153974.XA
Other languages
Chinese (zh)
Other versions
CN113822368A (en
Inventor
符颖
林弟忠
胡金蓉
文武
邹书蓉
周激流
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu University of Information Technology
Original Assignee
Chengdu University of Information Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu University of Information Technology filed Critical Chengdu University of Information Technology
Priority to CN202111153974.XA priority Critical patent/CN113822368B/en
Publication of CN113822368A publication Critical patent/CN113822368A/en
Application granted granted Critical
Publication of CN113822368B publication Critical patent/CN113822368B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to the field of image recognition, and particularly discloses an incremental target detection method based on no anchor, which comprises the following steps: step 1, selecting a target detection model; step 2, constructing a small sample target detection model based on the target detection model in the step 1; step 3, performing meta-training on the small sample target detection model; and 4, performing meta-test on the trained small sample target detection model. According to the invention, under the training of a large number of basic class data (images) containing rich labels and a small number of small samples (few-shot) containing labels, the detection effect of the new class of test pictures is improved, namely the improvement of mAP and AR scores is improved.

Description

Anchor-free incremental target detection method
Technical Field
The invention relates to the field of image recognition, in particular to an incremental target detection method based on no anchor.
Background
Under the background of the gradual maturation of high-performance parallel computing technology and the rapid development of neural networks, the target detection technology based on the deep learning method rapidly replaces the characteristic extraction method based on manual. The main task of object detection is to locate objects of interest from an input image and then accurately determine the class of each object of interest. The mature target detection algorithm is successfully deployed into actual application scenes by means of large-scale labeling data, such as video monitoring, unmanned driving of automobiles, traffic scene detection and other fields, but a large amount of labeled data is needed. The method is affected by insufficient data marking quantity, the practical application scene is not wide enough, and the detection task can be performed singly. Labeling data is a very expensive and laborious task, and it is impractical to obtain large-scale labeling data in most practical application scenarios, which greatly limits the application of existing target detection algorithms to more practical scenarios.
Based on this, how to learn an object detection model with a certain generalization capability with very little labeling data is a problem that needs to be studied more urgently. In recent years, a large number of researchers aim at target detection work under a scene with a small amount of annotation data, which is called small sample target detection. Most of current research models are based on a traditional Faster-RCNN, YOLO, SSD and other target detection frameworks, and the meta-training strategy of small sample learning is used for reference, after a large number of basic classes are trained, a small number of new classes of sample classes with labels are injected into the detection models, and the classification and regression of the new classes are completed, so that the method is very challenging.
In 2019 Fan et al introduce an attention-RPN module in the candidate frame region extraction network for fusing the features of the query image and the support set image, and simultaneously propose a multi-association detector for learning feature relations in the aspects of local, global and cross-correlation 3, and adopt a bidirectional contrast training strategy to perform similarity matching for detecting new classes. Wang et al [6] in 2020 proposed training in two stages, using the Faster-RCNN as a framework, to only fine tune the classification and regression sub-networks in the second stage, and to adapt to novel classes by readjusting the combining weights of the features. In 2020 Juan-Manuel et al [7] proposed to draw in the CenterNet framework, introduce a feature extractor for image feature extraction and a target locator for target location and a Resnet-50 network to extract the corresponding weight of the image output of each category and use the weight to complete the detection of the new category. In 2021, boSun et al [9] put in Wang et al, the ideas of feature pyramid model and contrast learning are introduced, the detection performance is improved by 2.7% on the COCO reference data set, and the performance is improved by 8.8% on the standard PASCAL VOC data set. The detection method is directly detected in a conventional reasoning mode, can easily introduce new types, is very efficient, has low data requirements on the new types, and has obvious advantages in performance compared with the existing method. Whereas Juan-Manuel et al references the ONCE network proposed by the CenterNet target detection framework, only the Resnet50 network is used in extracting class specific codes, which is not an optimal scheme from the experimental results, and the feature extraction capability of the new class is insufficient.
Disclosure of Invention
In order to solve the problems, the invention provides an anchor-free incremental target detection method, which can better extract characteristic information specific to the class and optimize the detection result.
The invention is realized by the following technical scheme:
an incremental target detection method based on no anchor comprises the following steps:
step 1, selecting a target detection model;
step 2, constructing a small sample target detection model based on the target detection model in the step 1;
step 3, performing meta-training on the small sample target detection model;
and 4, performing meta-test on the trained small sample target detection model.
As optimization, in step 1, the specific steps of constructing the target detection model include:
step 1.1, selecting a CenterNet detection network as a target detection model of a base class network.
In the step 2, as optimization, the specific steps of constructing a small sample target detection model are as follows:
step 2.1, a CenterNet detection network is regarded as being composed of a feature extractor and a target locator, wherein the feature extractor adopts a ResNet residual network as an encoder, a deconvolution network as a decoder, and all new classes and base classes share weights; the target locator contains convolution kernel weights of each individual class to be detected, analyzes the 3D feature map output by the feature extractor by using class-specific convolution kernels, and generates detection results of input samples in the form of heat-maps;
step 2.2, introducing a class-specific code generator having a class encoder of the same structure as the encoder of the feature extractor for generating convolution kernel weights C k And using the generated convolution kernel weights C k Parameterizing the target locatorThe method comprises the steps of carrying out a first treatment on the surface of the And (3) performing contrast learning branch training on the class codes generated by the class-specific code generator so as to improve the consistency among the class codes of the same class and enlarge the diversity of the class codes of different classes.
In the optimizing step 3, the specific steps of performing meta training on the small sample target detection model are as follows:
step 3.1, training a class feature extractor on a CenterNet detection network by utilizing a base class data set, wherein the class feature extractor is used for extracting features of new class data;
step 3.2, dividing the base class data set with the label into a support set and a query set image, inputting the query set into a feature extractor, inputting the support set into the class-specific code generator, extracting features from the base class data set by the feature extractor, and generating class codes related to the base class data set by the class-specific code;
and 3.3, performing joint training on the class specific code generator and the target localizer, so that the target localizer performs positioning learning of new class data by combining the class codes and the extracted features.
In step 4, as optimization, the specific steps of performing meta-test on the trained small sample target detection model are as follows:
step 4.1, inputting a small amount of new class data with labels to the trained class-specific encoder to generate weight parameters of specific classes, and parameterizing the target locator;
step 4.2, the class feature extractor performs feature extraction on the new class data and outputs a feature map,
and 4.3, finishing the detection of the target in the test image by the parameterized target positioner.
As optimization, in step 2.2, the specific steps of performing contrast learning branch training on the class code generated by the class specific code generator are as follows:
step 2.2.1, converting the code-like characteristics into 128-dimensional contrast characteristics by using a layer of multi-layer perceptron;
step 2.2.2, measuring the similarity between different kinds of codes on the normalized class characteristics of the multi-layer perceptron codes, and optimizing by using a loss function of supervised contrast learning class-specific codes to improve the similarity and the inter-class variability of the multi-layer perceptron codes.
As an optimization, in step 2.2.2, it is assumed that there are two class encodings X of the same tag i And X j The loss function of the supervised contrast learning class specific code is:
Figure BDA0003287955620000031
Figure BDA0003287955620000032
in the formula (1),
Figure BDA0003287955620000033
a loss function value L calculated for a single sample CCE An average loss function value for a meta-task;
in the formula (2)
Figure BDA0003287955620000034
Representing the number of samples belonging to the same class in a meta-task, II k≠i For the indication function, 0 is taken if and only if k=i, otherwise 1, τ is the temperature parameter for optimization, +.>
Figure BDA0003287955620000035
Coding X for class i 128-dimension normalized class characteristics obtained through multi-layer perceptron processing, wherein the molecules in the log function are X i And X j The characterization distance between them, the denominator of which is X i Characterization distance from all data (positive and negative samples) in each meta-task.
As an optimization, in step 2.2, the class-specific code generator outputs a convolution kernel weight C associated with the class in a global average pooling manner k To further parameterize the weight parameters of the object locator.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. according to the invention, under the training of a large number of basic class data (images) containing rich labels and a small number of small samples (few-shot) containing labels, the detection effect of the new class of test pictures is improved, namely the improvement of mAP and AR scores is improved.
2. The method of the invention has fewer misjudgments and can more accurately detect difficult targets in the image.
3. The invention can be easily migrated to other data sets for detection, which has important significance for the related work of small sample target detection.
Drawings
In order to more clearly illustrate the technical solutions of the exemplary embodiments of the present invention, the drawings that are needed in the examples will be briefly described below, it being understood that the following drawings only illustrate some examples of the present invention and therefore should not be considered as limiting the scope, and that other related drawings may be obtained from these drawings without inventive effort for a person skilled in the art. In the drawings:
fig. 1 is a schematic diagram of a network structure of a centrnet in an incremental target detection method based on no anchor according to the present invention;
fig. 2 is a schematic diagram of a network structure of a small sample target detection model in an incremental target detection method based on no anchor according to the present invention.
Fig. 3 is a schematic diagram of class code contrast learning branches in an incremental target detection method based on no anchor according to the present invention.
Fig. 4 is a schematic drawing of Precision and Recall curves.
FIGS. 5 and 6 are graphs comparing effects of ONCE (up) and an incremental target detection method based on no anchor according to the invention
Detailed Description
For the purpose of making apparent the objects, technical solutions and advantages of the present invention, the present invention will be further described in detail with reference to the following examples and the accompanying drawings, wherein the exemplary embodiments of the present invention and the descriptions thereof are for illustrating the present invention only and are not to be construed as limiting the present invention.
Research shows that training without mixing base class data (pictures) when a small amount of new class data (pictures) is injected into a small sample target detection model can cause great interference to the model, and the new class data is easy to misjudge as the trained base class data, so that the detection result is seriously affected. However, the injection of new data samples which are never seen into a model trained by a large number of base classes for detection is one of important works in the field of small sample target detection, because the new data samples with a small amount of marked data can be easily injected for detection.
The study divided the dataset into a base class and a new class with no class intersection at all, wherein the base class sample with rich tag information is included as feature guide, and a small number of new classes with tags are used for detecting model performance. After the constructed model is trained by the base class, the model can be effectively migrated to a new class sample which is not seen for detection, and a small sample set of the base class and the new class is not required to be constructed for retraining. Such a detection approach is very challenging, but allows easy injection of new sample classes for detection. A new class of small sample sets can be referenced "Xin Wang, thomas E.Huang, trevor Darrell, joseph E Gonzalez, and Fisher Yu.Frustratingly simple factor-shot object detection.In International Conference on Machine Learning (ICML), july 2020 @:
according to the incremental target detection method based on the no-anchor, target detection is divided into two types of anchor-free frames and anchor-free frames, wherein the anchor frames are predefined by a network, and complex calculation is carried out. The anchor-free frame is a detection target frame directly generated in the later stage of the network, and the incremental type is mentioned in the application, so that new data can be easily introduced, and the incremental type is directly added. Namely, a small sample target detection method based on contrast learning class specific coding comprises the following steps: and 1, selecting a target detection model. In order to construct an effective detection network that can be adapted to detect new class samples, it is first necessary to select an appropriate base class detection network, the selection not being arbitrary, since efficient and rapid detection of a small number of new class samples needs to be accomplished. In recent years, fast-RCNN is often used as a base class detection network, but it adopts a two-stage design and classifies based on softmax, and interaction between classes makes detecting independent new classes inflexible. Therefore, the target detection model selects a central Net detection network of an Anchor-free target detection algorithm capable of carrying out rapid and efficient detection as a base class network, and compared with networks such as YOLO, SSD and the like, the target detection model can achieve a trade-off in detection speed and precision and is easier to construct a specific class of characterization extraction module.
The idea of the network design of the CenterNet detection is derived from the idea of key point detection, a central point is found by adopting key point estimation, target attributes such as the height, the width and the like of a target bounding box are obtained by regression, post-processing such as key point grouping, non-maximum suppression and the like is not needed, and the network structure is shown in figure 1. The CenterNet transmits the training images into a full convolution encoding-decoding network to obtain multi-channel hematmap, wherein each category corresponds to one hematmap. The 2D hetmap peak points are the center points, and the peak point positions also predict the height and width information of the targets. Specific detection details are described later. The keypoint-based target detection framework not only eliminates region candidate boxes, but also can generate a predictive heat map unique to each class and perform independent detection through an activation threshold. The detection framework is very suitable for detecting the small sample target to inject a new class, and the basic class and the new class are not interfered with each other.
And 2, constructing a small sample target detection model based on the target detection model in the step 1.
In order to make the detection model output corresponding weight parameters for each type of image, effective characteristics of new types are extracted independently, and the target detection model-CenterNet network structure is adjusted. Instead of training and obtaining weight parameters in an original end-to-end mode, a training strategy of meta learning is introduced, and a new small sample target detection model is designed. The network structure is shown in fig. 2.
We consider a centrnet network as consisting of a feature extractor and a target locator, where the feature extractor network uses a res net residual network as the encoder, a deconvolution network as the decoder, and all new classes and base classes share weights. And a target locator containing specific weight parameters of each individual class to be detected, generating detection results of the input samples in the form of a hetmap using the 3D feature map output by the class-specific convolution kernel analysis feature extractor.
And then, introducing a class-specific code generator to generate convolution kernel weights in the target localizer, and replacing the original mode of updating the weights through iterative training. The specific code generator adopts the same encoder structure as the characteristic extraction network, but does not access the deconvolution network, and outputs weight parameters (convolution kernel weights) C related to the class in a global average pooling mode k To further parameterize the parameters of the object locator, wherein the weight parameters C of the class-specific code generator network k Parameters of the synthetic target locator network (such as the convolution kernel weight parameters associated with the synthetic class). In addition, a comparison learning branch is connected behind the class-specific encoder and is used for guiding a class-specific code generator to learn class-specific codes with comparison perception, calculating the feature similarity of the class-specific codes and better modeling the similarity and inter-class difference features.
Specifically, the specific steps of comparing and learning branch training for the class codes generated by the class specific code generator are as follows:
step 2.2.1, converting the code-like characteristics into 128-dimensional contrast characteristics by using a layer of multi-layer perceptron (1-layermodule-layer-perceptron (MLP) -head);
step 2.2.2, measuring the similarity between different kinds of codes on the normalized class characteristics of the multi-layer perceptron codes, and optimizing by using a loss function of supervised contrast learning class-specific codes to improve the similarity and the inter-class variability of the multi-layer perceptron codes.
The loss function of the supervised contrast learning class specific code is formulated by self-supervision and supervised contrast learning-related work heuristics.
Under the supervision scenario, positive sample pairs are arranged among the characteristics of class codes in the same class, and the characteristics of class codes in different classes form negative sample pairs. Class code X for approximating identical labels i And X j Is different in distance and distanceThe distance of the label class codes designs a loss function of the supervised comparison learning class specific codes, as shown in formulas (1) and (2). Similarity between class-encoded features is measured by means of an inner product. In training, multiple meta-tasks are sampled, and the support set portion of each meta-task samples N samples from given data, denoted as { x } k ,y k } k =1,2...,N,y k Is x k Is a label of (a).
Figure BDA0003287955620000061
Figure BDA0003287955620000062
In the formula (1)
Figure BDA0003287955620000063
Calculated loss, L for a single sample CCE Is the average Loss of one meta-task.
In the formula (2)
Figure BDA0003287955620000064
Represents the number of samples belonging to the same class in a meta-task, II k≠i To indicate a function, 0 is taken if and only if k=i, otherwise 1, τ is the temperature parameter [12 ] to be optimized]。/>
Figure BDA0003287955620000065
Coding X for class i 128-dimensional normalized class features obtained through multi-layer perceptron (MLP-Head) processing. The numerator is X in log function i And X j The reproduction distance between them is X as the denominator i The presentation distance from all data (positive and negative samples) in each meta-task. The design well characterizes the intra-class similarity and the inter-class difference, improves the coding performance of the class-specific encoder, and proves the effectiveness in later experiments.
And step 3, performing meta-training on the small sample target detection model.
By reference to the training strategy of meta-learning, in order to make full use of the basic categories, meta-training is divided into two serial stages: training a class feature extractor on a central Net detection network structure by using a large amount of basic class data in the first stage for extracting the features of a small amount of new class data; the second stage of training is divided into multiple subsections (epoode, the provenance of epoode is from Oriol vitamins, charles Blundell, timothy Lillicrap, daan Wierstra, et al matching networks for one shot training in neuroips, 2016.2t. -Y.Lin, P.Goyal, R.Girshick, K.He, and p. dol' ar.focal loss for dense object detection, iccv, 2017.), the epoode divides the labeled training data into query set and support set images, which are input into the feature extractor and class-specific encoder of the small sample object detection model, respectively, using a strategy of joint training class-specific encoder and object locator, wherein the object locator performs the location learning of the small sample object in combination with the class-encoded and extracted features.
In this training method, each epoode samples a plurality of meta-tasks, and the meta-tasks include different category combinations, and this mechanism enables the model to learn common parts in different meta-tasks, such as how to extract important features and compare the similarity of samples, forget task related parts in the meta-tasks, and so on. The model learned by the learning mechanism can better classify the model in the face of new and unseen meta-task, and is more beneficial to learning class coding.
First stage training: learning of feature extractors
On the base class data set, first random flip, random scale, clipping and color dithering are used as data enhancement T 1 Standard supervised training is performed in a similar manner to that of a centrnet, given training images I e R of height h, width w h×w×3 A feature extractor f (& gt) based on a ResNet residual network is input to extract a feature map m=f (I),
Figure BDA0003287955620000071
r is a downsampling factor and m comprises c channelsCorresponding to c categories in the target object. Then in the target locator h, the convolution kernel c of the base class is learned k ∈R 1×1×c Convolving the feature map m to obtain the hetmap of each class>
Figure BDA0003287955620000072
Wherein the method comprises the steps of
Figure BDA0003287955620000073
Representing the detected keypoints, i.e. the c-class object is detected at the (x, y) coordinates. But->
Figure BDA0003287955620000074
It indicates that the current coordinate point does not have a c-class object, and the c-class object is regarded as a background.
As shown in the formula (3),
Figure BDA0003287955620000075
the number of channels is c+4. The c channels correspond to the predictions of c categories, and the remaining four channels respectively correspond to the prediction of the central point offset of the boundary frame and the prediction of the size length and width of the boundary frame. As indicated by the convolution operation, K b Is the number of base class samples.
Figure BDA0003287955620000076
In training, we want to group-trunk key points
Figure BDA0003287955620000077
Calculated coordinates after downsampling for training, we set as +.>
Figure BDA0003287955620000078
Respectively passing through Gaussian kernel Y on the c channel characteristic diagrams obtained by downsampling xyc The individual group-trunk key points are distributed on the hetmap to +.>
Figure BDA0003287955620000079
Is expressed in terms of (a). Y is Y xyc The calculation mode of (2) is shown as a formula (4), wherein sigma p Is an adaptive standard deviation associated with the target length and width. If the gaussian distributions generated by different objects of the same class overlap, a larger gaussian point is selected within the overlap.
Figure BDA00032879556200000710
The key point k on the Heatm is trained by using a loss function rewritten to the Focal loss 20 as shown in equation (5).
Figure BDA0003287955620000081
Where α and β are the hyperparameters of the Focal loss, experiments were performed with α=2 and β=4, N being the number of image I keypoints, see [13] for more details.
The feature map obtained by downsampling is re-mapped back onto the original image with a precision deviation, so that one offset needs to be predicted for each center point, that is,
Figure BDA0003287955620000082
the center points of all classes c share the same offset prediction, which is used by L 1 loss is trained as shown in equation (6).
Figure BDA0003287955620000083
Wherein the method comprises the steps of
Figure BDA0003287955620000084
For predicting the resulting offset +.>
Figure BDA0003287955620000085
Representing values calculated in advance during the training process.
Assume that
Figure BDA0003287955620000086
For object k, the category is c k The center point is
Figure BDA0003287955620000087
And regressing the size of each target k to finally regress to
Figure BDA0003287955620000088
S k Is also calculated in advance as the length and width values after the group-trunk downsampling. We use the keypoint predictive network +.>
Figure BDA0003287955620000089
To predict all center points, to reduce the burden of regression calculation, +.>
Figure BDA00032879556200000810
As the predicted length and width values for each point in the hetmap, L is also used 1 The regression loss function is trained. As shown in formula (7):
Figure BDA00032879556200000811
during prediction, the local peak point of each c category on the hetmap is extracted respectively
Figure BDA00032879556200000812
These peaks are activation values greater than or equal to Y k The 8 neighborhood points in the model are probability values of the c-class target center points, and the first 100 points are reserved in a 3X3 maximum pooling mode. Calling out center points greater than the threshold value from the selected 100 peak points according to the set threshold value>
Figure BDA00032879556200000813
As a final result. And use +.>
Figure BDA00032879556200000814
And (5) regressing the target frame as a confidence index of the current point. The predicted target frame is shown in formula (8):
Figure BDA00032879556200000815
wherein the method comprises the steps of
Figure BDA00032879556200000816
Is a prediction of the center point offset. />
Figure BDA00032879556200000817
Wherein->
Figure BDA00032879556200000818
Is the size prediction of the bounding box. Both offset and bounding box use L 1 The regression loss function is trained as shown in equation (6) and equation (7).
The training phase total loss consists of three parts Heatmap, offset, size for optimizing the parameters of the feature extractor f and the c-parameters of the locator. As shown in formula (9), lambda was set in the experiment size =0.1 and λ off =1。
L det =L ksize L sizeoff L off (9)
The goal at this stage is to learn only one feature extractor, while the goal locator is a conventional centrnet locator, which is discarded at the second stage, using only the trained feature extractor f ().
Training in the second stage: class-specific encoder learning
The class-related convolution kernel parameters obtained by training in the previous stage are only fixed parameters of the basic class, and in the second stage, the parameters of the characteristic extractor are frozen, and a class-specific encoder connected with a comparison learning branch is mainly trained, so that the class-specific encoder can efficiently encode a new class according to a small number of labeling samples. To efficiently train class-specific encoders, we use the epoode meta-learning training strategy [13].
The specific method comprises the following steps: the whole training data is divided into a plurality of epoode, each epoode contains a specified number of meta-tasks, and the meta-tasks sample a class label set L (for example, L=banana, umbrella … }) from all classes, and according to L, we choose a training sample of a support set S and a query set Q for each meta-task for training, wherein S and Q are randomly distributed by pictures of each class in the label set. Data enhancement T with random horizontal flipping, random cropping, and color dithering as support set images 2 Image x in each support set is subjected to two data enhancement processes to generate sample x a And x b As a basic positive sample pair, the input class-specific encoder extracts features. Class-specific encoders are initialized with encoder weights of the feature extractor trained in the first stage. In the forward propagation process, for each meta-task, T is utilized 1 And (3) carrying out data enhancement on the query set image, extracting the features of the query set by using a feature extractor trained in a stage, and obtaining a feature map of c channels according to a formula (10). Meanwhile, class-specific encoder utilizes pass-through T 2 Generating corresponding weight parameters by processing the obtained support set image
Figure BDA0003287955620000091
The specific operation is shown in formula (11).
m Q =f(I),I∈Q (10)
Figure BDA0003287955620000092
Wherein m is Q For querying the features of the set image I obtained via the feature extractor,
Figure BDA0003287955620000093
a support set sample for class k. Finally, 3X 256-dimensional class code is output by global averaging pooling->
Figure BDA0003287955620000094
On the one hand utilize support set SWith samples->
Figure BDA0003287955620000095
Correspondingly generated
Figure BDA0003287955620000096
Inputting a contrast learning branch, encoding the MLP-head into normalized contrast characteristics, then calculating similarity scores of different types of codes, and optimizing an objective function through back propagation. On the other hand, in order to facilitate comparison with other detection methods, only all x are taken a Sample derived class code, class code of the same class +.>
Figure BDA0003287955620000097
Average pooling to get +.>
Figure BDA0003287955620000098
We gather image features m for queries of the same category Q And these classes code->
Figure BDA0003287955620000101
Performing convolution operation in the input target positioner to generate a heat map +.>
Figure BDA0003287955620000102
Target detection of the query set image is completed as shown in equation (12).
Figure BDA0003287955620000103
Like the one-stage CenterNet, we use L 1 loss, as shown in equation (13). The average absolute prediction error of Q is minimized by updating parameters of the class code generator and the target locator. Where n is the number of query set images and Z is group-trunk Heatm ap.
Figure BDA0003287955620000104
The total loss of the training phase consists of four parts, heatmap, offset, size and contrast loss, as shown in formula (14):
L meta-det =L Qsize L sizeoff L off +L CCE (14)
and 4, performing meta-test on the trained small sample target detection model.
After the feature extractor, the class encoder and the target locator are subjected to meta-training, a small number of new class samples with labels are input to the class encoder to generate weight parameters of a specific class, and the target locator is parameterized. And the feature extractor performs feature extraction on the test picture to output a feature image, and then the target locator completes detection of the target in the test image. The new class of test pictures detect possible targets through forward propagation, and the model does not need to be trained or fine-tuned again.
A robust small sample detection model is obtained through two-stage training, and one or more new types of model injection are tested in a manner similar to the second-stage training. Firstly, inputting a support set of a group of new class samples into a model, and extracting class codes of the new class by a class specific encoder; inputting new test pictures to the model, and extracting features by a feature extractor to obtain a multi-channel feature map; and (3) inputting the two images into a target positioner, performing convolution operation in a mode of a formula (3) to obtain a thermodynamic diagram, extracting peak points in a threshold range in the thermodynamic diagram, and obtaining a target candidate frame according to a formula (8) to obtain a detection result of the test picture.
Under the training of a large number of basic class images containing rich labels and a small number of small sample (few-shot) new classes containing labels, the detection effect of the new class test pictures is improved, namely the improvement of mAP and AR scores is improved.
Where AP is (average precision) average precision and mAP is the average precision of m sample types. Precision is a prediction result that indicates how many samples that are predicted to be positive are true positive samples. (of all samples predicted to be positive samples, the proportion of the positive samples is actually) all predicted to be positive samples, how much the positive samples were correctly predicted (used to evaluate the predicted accuracy).
Recall (also known as TPR) is for the original sample and indicates how many positive examples in the sample were predicted to be correct. (the proportion of predicted positive samples in all actual positive samples) how much of all positive samples were correctly predicted (to evaluate how many correct samples were predicted).
As shown in fig. 4, the area below the pr curve is the AP value.
Experiments are carried out by adopting COCO2017[14] reference data sets commonly used for target detection, wherein 118287 training sets and 5000 verification sets are adopted, and total 80 object categories are covered, wherein 20 categories are used as new categories. The 20 categories are the same as those covered by the PASCAL VOC2007[15] data set, with the remaining 60 categories in the COCO data set as base categories. Thus, experiments can be divided into co-dataset evaluation and cross-dataset evaluation with 20 new classes of co dataset 60-based class PASCAL VOC datasets as tests.
COCO co data set evaluation:
the base class training image of the COCO data set is adopted for two stages of model element training, firstly, the size of the base class training image is adjusted to 512x512, supervised training is carried out according to the training mode of the first stage, then in the second stage, under the limitation of GPU memory, 28 meta-tasks are randomly extracted by each epoode, the meta-tasks comprise detection of 3 categories, each category comprises 5 label frames, and more tasks are beneficial to improvement of performance. For meta-testing, we randomly sampled multiple sets of support sets containing only 20 new class samples in the COCO training set for extracting class codes, each set of support sets being injected into the model once, the model being updated only once, wherein each new class contains only { k=1, 5,10} annotation boxes. And evaluate the performance of our small sample target detection model using the new class image on the COCO validation set as a test image. We compared our model with several other popular small sample target detection methods: 1) Standard Fine-Tuning detection model based on Faster RCNN; 2) Model-Agnostic Meta-Learning (MAML); 3) Non-incremental Few-shot object detection via feature reweighting; 4) Incremental small sample object detection network ONCE. The experimental results are shown in table 1. As can be seen from the results of the test of several trained small sample detection models injected into the new class in { k=1, 5,10}, our method has higher AP values than other comparative methods, and the best results are obtained for both AP and AR in the test of the base class and the mixed test of the base class and the new class. Meanwhile, in the case of { k=10 }, the pair of ONCE and our method detection results is as shown in fig. 5, it is easy to see that our method misjudges less and detects difficult targets in the image more accurately.
TABLE 1 COCO vs. dataset test results
Figure BDA0003287955620000111
Figure BDA0003287955620000121
VOC cross dataset evaluation:
in two stages of meta training, the base class data of the COCO data set are used for training, and the training mode is completely consistent with the evaluation method of the COCO same data set. The difference is that in the meta-test phase, the support sets of multiple new classes are sampled on the training set of the COCO2017 dataset, and likewise, each new class contains only { k=5, 10} label boxes, and the model is updated only once in a single injection model per support set. And uses the PASCAL VOC2007 test set images as test images to evaluate the performance of our small sample target detection model. The experimental comparison results are shown in table 2. In addition, in the test of ten label boxes for each new class sample in each support set, { k=10 }, a comparison graph of the detection effects of ONCE and our method is given in fig. 6. From the results, it can be seen that five and ten label boxes are sampled for each new class on the COCO training set, and tested on the PASCAL VOC test set, and the AP and AR values are better than the detection results on the COCO data set in the obtained detection results. It is illustrated that detecting images of a COCO dataset is more challenging than a PASCAL VOC dataset in a small sample target detection task. At the same time, it can be concluded that our framework can be easily migrated to other data sets for detection, which is of great importance for the relevant work of small sample target detection.
Table 2 VOC cross dataset detection comparison results
Figure BDA0003287955620000122
The invention has the advantages that the thought of supervised contrast learning is added into the small sample target detection model, the contrast learning branch is connected, the performance of model class specific coding is improved, and the problem of insufficient performance of the model on generalization of new classes is solved to a certain extent. The training strategy of the invention does not construct a balanced small sample set of the basic class and the new class for training in meta-training, but directly injects a small amount of new class samples which are never seen by the model for detection in the meta-test stage. The method is a challenging work, because the newly injected detection category is easily misjudged as the base category, the generalization performance of the model is very tested, but the work is the focus of research in the field of small sample target detection research, because by adopting the method, the method can easily introduce a new category sample for effective detection, and is more beneficial to future implementation into specific application scenes. The method is evaluated through main evaluation indexes AP and AR of target detection in an experimental part, so that the good detection effect can be obtained, and the problem that the existing detection model is insufficient in generalization performance of a new class is solved to a certain extent. Meanwhile, contrast learning benefits from more contrast sample size and more GPU memory, more mate-tasks can be sampled at each epoode, and performance can be further improved.
The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims (5)

1. The incremental target detection method based on the anchor-free is characterized by comprising the following steps of:
step 1, selecting a target detection model;
step 1.1, selecting a CenterNet detection network as a target detection model of a base class network;
step 2, constructing a small sample target detection model based on the target detection model in the step 1;
step 2.1, a CenterNet detection network is regarded as being composed of a feature extractor and a target locator, wherein the feature extractor adopts a ResNet residual network as an encoder, a deconvolution network as a decoder, and all new classes and base classes share weights; the target locator contains convolution kernel weights of each individual class to be detected, analyzes the 3D feature map output by the feature extractor by using class-specific convolution kernels, and generates detection results of input samples in the form of heat-maps;
step 2.2, introducing a class-specific code generator having a class encoder of the same structure as the encoder of the feature extractor for generating convolution kernel weights C k And using the generated convolution kernel weights C k Parameterizing the target locator; the class codes generated by the class specific code generator are subjected to contrast learning branch training so as to improve the consistency among the class codes of the same class and enlarge the difference of the class codes of different classes;
step 3, performing meta-training on the small sample target detection model;
step 3.1, training a class feature extractor on a CenterNet detection network by utilizing a base class data set, wherein the class feature extractor is used for extracting features of new class data;
step 3.2, dividing the base class data set with the label into a support set and a query set image, inputting the query set into a feature extractor, inputting the support set into the class-specific code generator, extracting features from the base class data set by the feature extractor, and generating class codes related to the base class data set by the class-specific code;
step 3.3, performing joint training on the class specific code generator and the target positioner so that the target positioner performs positioning learning of new class data by combining class codes and extracted features;
training the class-specific encoder by adopting an epicode meta-learning training strategy:
dividing the whole training data into a plurality of epoode, wherein each epoode contains a designated number of meta-tasks, the meta-tasks sample a class label set L from all classes, and a training sample of a support set S and a query set Q is selected for training according to the L for each meta-task, wherein S and Q are obtained by randomly distributing pictures of each class in the label set;
data enhancement T with random horizontal flipping, random cropping, and color dithering as support set images 2 Two data enhancement is performed on the image x in each support set to generate a sample x a And x b As a basic positive sample pair, inputting a class-specific encoder to extract a feature;
initializing the class-specific encoder with the encoder weights of the trained feature extractor;
in the forward propagation process, for each meta-task, T is utilized 1 Data enhancement is carried out on the query set image, the feature extractor trained in one stage is utilized to extract the features of the query set, the feature image of c channels is obtained through a formula (10), and meanwhile, the class-specific encoder utilizes the data subjected to T 2 Generating corresponding weight parameters by processing the obtained support set image
Figure QLYQS_1
The specific operation is shown in the formula (11):
m Q =f(I),I∈Q (10)
Figure QLYQS_2
wherein the method comprises the steps ofm Q For querying the features of the set image I obtained via the feature extractor,
Figure QLYQS_3
a support set sample for class k;
output 3X 256-dimensional class encoding by global averaging pooling
Figure QLYQS_4
On the one hand use all samples in support set S +.>
Figure QLYQS_5
Correspondingly generated->
Figure QLYQS_6
Inputting a contrast learning branch, encoding the MLP-head into normalized contrast characteristics, then calculating similarity scores of different types of codes, optimizing an objective function through back propagation, and taking all x on the other hand a Sample derived class code, class code of the same class +.>
Figure QLYQS_7
Average pooling to get +.>
Figure QLYQS_8
Image feature m for query sets of the same category Q And these classes code->
Figure QLYQS_9
Performing convolution operation in the input target positioner to generate a heat map +.>
Figure QLYQS_10
As shown in equation (12), target detection of the query set image is completed,
Figure QLYQS_11
by L 1 loss, as shown in equation (13), by moreThe parameters of the new class code generator and target locator minimize the average absolute prediction error of Q, where n is the number of query set images, Z is the group-trunk-hetmap,
Figure QLYQS_12
the total loss of training phase consists of Heatmap, offset, size, contrast loss four parts, as shown in equation (14):
L meta-det =L Qsize L sizeoff L off +L CCE (14)。
2. the anchor-free incremental target detection method according to claim 1, wherein in step 4, the specific step of performing meta-test on the trained small sample target detection model is as follows:
step 4.1, inputting a small amount of new class data with labels to the trained class-specific encoder to generate weight parameters of specific classes, and parameterizing the target locator;
step 4.2, the class feature extractor performs feature extraction on the new class data and outputs a feature map,
and 4.3, finishing the detection of the target in the test image by the parameterized target positioner.
3. The method for detecting an incremental target based on no anchor according to claim 1, wherein in step 2.2, the specific step of performing the contrast learning branch training on the class code generated by the class-specific code generator is as follows:
step 2.2.1, converting the code-like characteristics into 128-dimensional contrast characteristics by using a layer of multi-layer perceptron;
step 2.2.2, measuring the similarity between different kinds of codes on the normalized class characteristics of the multi-layer perceptron codes, and optimizing by using a loss function of supervised contrast learning class-specific codes to improve the similarity and the inter-class variability of the multi-layer perceptron codes.
4. A method of anchor-free incremental object detection according to claim 3 wherein in step 2.2.2, it is assumed that there are two class encodings X of the same tag i And X j The loss function of the supervised contrast learning class specific code is:
Figure QLYQS_13
Figure QLYQS_14
in the formula (1),
Figure QLYQS_15
a loss function value L calculated for a single sample CCE An average loss function value of a meta-task, N is the number of samples, y i Coding X for class i Is a label of y j Coding X for class j Is a label of (2);
in the formula (2)
Figure QLYQS_16
Representing the number of samples belonging to the same class in a meta-task, II k≠i For the indication function, 0 is taken if and only if k=i, otherwise 1, τ is the temperature parameter for optimization, +.>
Figure QLYQS_17
Coding X for class i 128-dimension normalized class characteristics obtained through multi-layer perceptron processing, wherein the molecules in the log function are X i And X j The characterization distance between them, the denominator of which is X i From the characterization distance of all data in each metatask including positive and negative samples.
5. An anchor-free incremental object detector in accordance with claim 1The method is characterized in that in the step 2.2, the class-specific code generator outputs the convolution kernel weight C related to the class in a global average pooling mode k To further parameterize the weight parameters of the object locator.
CN202111153974.XA 2021-09-29 2021-09-29 Anchor-free incremental target detection method Active CN113822368B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111153974.XA CN113822368B (en) 2021-09-29 2021-09-29 Anchor-free incremental target detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111153974.XA CN113822368B (en) 2021-09-29 2021-09-29 Anchor-free incremental target detection method

Publications (2)

Publication Number Publication Date
CN113822368A CN113822368A (en) 2021-12-21
CN113822368B true CN113822368B (en) 2023-06-20

Family

ID=78921753

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111153974.XA Active CN113822368B (en) 2021-09-29 2021-09-29 Anchor-free incremental target detection method

Country Status (1)

Country Link
CN (1) CN113822368B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114663707A (en) * 2022-03-28 2022-06-24 中国科学院光电技术研究所 Improved few-sample target detection method based on fast RCNN
CN115880266B (en) * 2022-12-27 2023-08-01 深圳市大数据研究院 Intestinal polyp detection system and method based on deep learning
CN116363085B (en) * 2023-03-21 2024-01-12 江苏共知自动化科技有限公司 Industrial part target detection method based on small sample learning and virtual synthesized data

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112329827A (en) * 2020-10-26 2021-02-05 同济大学 Increment small sample target detection method based on meta-learning
CN112861720A (en) * 2021-02-08 2021-05-28 西北工业大学 Remote sensing image small sample target detection method based on prototype convolutional neural network

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7167123B2 (en) * 1999-05-25 2007-01-23 Safe Zone Systems, Inc. Object detection method and apparatus
CN110969205A (en) * 2019-11-29 2020-04-07 南京恩博科技有限公司 Forest smoke and fire detection method based on target detection, storage medium and equipment
WO2021154624A1 (en) * 2020-01-27 2021-08-05 Matthew Charles King System and method for performing machine vision recognition of dynamic objects
CN112819110B (en) * 2021-04-19 2021-06-29 中国科学院自动化研究所 Incremental small sample target detection method and system based on weight generation
CN113379718B (en) * 2021-06-28 2024-02-02 北京百度网讯科技有限公司 Target detection method, target detection device, electronic equipment and readable storage medium
CN113361645B (en) * 2021-07-03 2024-01-23 上海理想信息产业(集团)有限公司 Target detection model construction method and system based on meta learning and knowledge memory
CN113392855A (en) * 2021-07-12 2021-09-14 昆明理工大学 Small sample target detection method based on attention and comparative learning
CN113393457B (en) * 2021-07-14 2023-02-28 长沙理工大学 Anchor-frame-free target detection method combining residual error dense block and position attention

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112329827A (en) * 2020-10-26 2021-02-05 同济大学 Increment small sample target detection method based on meta-learning
CN112861720A (en) * 2021-02-08 2021-05-28 西北工业大学 Remote sensing image small sample target detection method based on prototype convolutional neural network

Also Published As

Publication number Publication date
CN113822368A (en) 2021-12-21

Similar Documents

Publication Publication Date Title
CN113822368B (en) Anchor-free incremental target detection method
Li et al. DXSLAM: A robust and efficient visual SLAM system with deep features
Cao et al. Few-shot video classification via temporal alignment
Wang et al. Unsupervised deep representation learning for real-time tracking
CN109815364B (en) Method and system for extracting, storing and retrieving mass video features
Feichtenhofer et al. Detect to track and track to detect
CN111476302B (en) fast-RCNN target object detection method based on deep reinforcement learning
CN109671102B (en) Comprehensive target tracking method based on depth feature fusion convolutional neural network
CN105844283B (en) Method, image search method and the device of image classification ownership for identification
CN110263659B (en) Finger vein recognition method and system based on triplet loss and lightweight network
CN105718960B (en) Based on convolutional neural networks and the matched image order models of spatial pyramid
CN109443382A (en) Vision SLAM closed loop detection method based on feature extraction Yu dimensionality reduction neural network
Lai et al. Recognizing complex events in videos by learning key static-dynamic evidences
CN115171165A (en) Pedestrian re-identification method and device with global features and step-type local features fused
CN112801068B (en) Video multi-target tracking and segmenting system and method
Wei et al. Transformer-based domain-specific representation for unsupervised domain adaptive vehicle re-identification
Bayraktar et al. Fast re-OBJ: Real-time object re-identification in rigid scenes
CN115690549A (en) Target detection method for realizing multi-dimensional feature fusion based on parallel interaction architecture model
Lin et al. Region-based context enhanced network for robust multiple face alignment
Ming et al. Cgis-net: Aggregating colour, geometry and implicit semantic features for indoor place recognition
Zheng et al. Boundary adjusted network based on cosine similarity for temporal action proposal generation
Bui et al. D2s: Representing local descriptors and global scene coordinates for camera relocalization
Hu et al. STRNN: End-to-end deep learning framework for video partial copy detection
Zhang et al. SAPS: Self-attentive pathway search for weakly-supervised action localization with background-action augmentation
Sicre et al. Dense sampling of features for image retrieval

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant