CN113822368B

CN113822368B - Anchor-free incremental target detection method

Info

Publication number: CN113822368B
Application number: CN202111153974.XA
Authority: CN
Inventors: 符颖; 林弟忠; 胡金蓉; 文武; 邹书蓉; 周激流
Original assignee: Chengdu University of Information Technology
Current assignee: Chengdu University of Information Technology
Priority date: 2021-09-29
Filing date: 2021-09-29
Publication date: 2023-06-20
Anticipated expiration: 2041-09-29
Also published as: CN113822368A

Abstract

The invention relates to the field of image recognition, and particularly discloses an incremental target detection method based on no anchor, which comprises the following steps: step 1, selecting a target detection model; step 2, constructing a small sample target detection model based on the target detection model in the step 1; step 3, performing meta-training on the small sample target detection model; and 4, performing meta-test on the trained small sample target detection model. According to the invention, under the training of a large number of basic class data (images) containing rich labels and a small number of small samples (few-shot) containing labels, the detection effect of the new class of test pictures is improved, namely the improvement of mAP and AR scores is improved.

Description

Anchor-free incremental target detection method

Technical Field

The invention relates to the field of image recognition, in particular to an incremental target detection method based on no anchor.

Background

Under the background of the gradual maturation of high-performance parallel computing technology and the rapid development of neural networks, the target detection technology based on the deep learning method rapidly replaces the characteristic extraction method based on manual. The main task of object detection is to locate objects of interest from an input image and then accurately determine the class of each object of interest. The mature target detection algorithm is successfully deployed into actual application scenes by means of large-scale labeling data, such as video monitoring, unmanned driving of automobiles, traffic scene detection and other fields, but a large amount of labeled data is needed. The method is affected by insufficient data marking quantity, the practical application scene is not wide enough, and the detection task can be performed singly. Labeling data is a very expensive and laborious task, and it is impractical to obtain large-scale labeling data in most practical application scenarios, which greatly limits the application of existing target detection algorithms to more practical scenarios.

Based on this, how to learn an object detection model with a certain generalization capability with very little labeling data is a problem that needs to be studied more urgently. In recent years, a large number of researchers aim at target detection work under a scene with a small amount of annotation data, which is called small sample target detection. Most of current research models are based on a traditional Faster-RCNN, YOLO, SSD and other target detection frameworks, and the meta-training strategy of small sample learning is used for reference, after a large number of basic classes are trained, a small number of new classes of sample classes with labels are injected into the detection models, and the classification and regression of the new classes are completed, so that the method is very challenging.

In 2019 Fan et al introduce an attention-RPN module in the candidate frame region extraction network for fusing the features of the query image and the support set image, and simultaneously propose a multi-association detector for learning feature relations in the aspects of local, global and cross-correlation 3, and adopt a bidirectional contrast training strategy to perform similarity matching for detecting new classes. Wang et al [6] in 2020 proposed training in two stages, using the Faster-RCNN as a framework, to only fine tune the classification and regression sub-networks in the second stage, and to adapt to novel classes by readjusting the combining weights of the features. In 2020 Juan-Manuel et al [7] proposed to draw in the CenterNet framework, introduce a feature extractor for image feature extraction and a target locator for target location and a Resnet-50 network to extract the corresponding weight of the image output of each category and use the weight to complete the detection of the new category. In 2021, boSun et al [9] put in Wang et al, the ideas of feature pyramid model and contrast learning are introduced, the detection performance is improved by 2.7% on the COCO reference data set, and the performance is improved by 8.8% on the standard PASCAL VOC data set. The detection method is directly detected in a conventional reasoning mode, can easily introduce new types, is very efficient, has low data requirements on the new types, and has obvious advantages in performance compared with the existing method. Whereas Juan-Manuel et al references the ONCE network proposed by the CenterNet target detection framework, only the Resnet50 network is used in extracting class specific codes, which is not an optimal scheme from the experimental results, and the feature extraction capability of the new class is insufficient.

Disclosure of Invention

In order to solve the problems, the invention provides an anchor-free incremental target detection method, which can better extract characteristic information specific to the class and optimize the detection result.

The invention is realized by the following technical scheme:

an incremental target detection method based on no anchor comprises the following steps:

step 1, selecting a target detection model;

step 2, constructing a small sample target detection model based on the target detection model in the step 1;

step 3, performing meta-training on the small sample target detection model;

and 4, performing meta-test on the trained small sample target detection model.

As optimization, in step 1, the specific steps of constructing the target detection model include:

step 1.1, selecting a CenterNet detection network as a target detection model of a base class network.

In the step 2, as optimization, the specific steps of constructing a small sample target detection model are as follows:

step 2.1, a CenterNet detection network is regarded as being composed of a feature extractor and a target locator, wherein the feature extractor adopts a ResNet residual network as an encoder, a deconvolution network as a decoder, and all new classes and base classes share weights; the target locator contains convolution kernel weights of each individual class to be detected, analyzes the 3D feature map output by the feature extractor by using class-specific convolution kernels, and generates detection results of input samples in the form of heat-maps;

step 2.2, introducing a class-specific code generator having a class encoder of the same structure as the encoder of the feature extractor for generating convolution kernel weights C _k And using the generated convolution kernel weights C _k Parameterizing the target locatorThe method comprises the steps of carrying out a first treatment on the surface of the And (3) performing contrast learning branch training on the class codes generated by the class-specific code generator so as to improve the consistency among the class codes of the same class and enlarge the diversity of the class codes of different classes.

In the optimizing step 3, the specific steps of performing meta training on the small sample target detection model are as follows:

step 3.1, training a class feature extractor on a CenterNet detection network by utilizing a base class data set, wherein the class feature extractor is used for extracting features of new class data;

step 3.2, dividing the base class data set with the label into a support set and a query set image, inputting the query set into a feature extractor, inputting the support set into the class-specific code generator, extracting features from the base class data set by the feature extractor, and generating class codes related to the base class data set by the class-specific code;

and 3.3, performing joint training on the class specific code generator and the target localizer, so that the target localizer performs positioning learning of new class data by combining the class codes and the extracted features.

In step 4, as optimization, the specific steps of performing meta-test on the trained small sample target detection model are as follows:

step 4.1, inputting a small amount of new class data with labels to the trained class-specific encoder to generate weight parameters of specific classes, and parameterizing the target locator;

step 4.2, the class feature extractor performs feature extraction on the new class data and outputs a feature map,

and 4.3, finishing the detection of the target in the test image by the parameterized target positioner.

As optimization, in step 2.2, the specific steps of performing contrast learning branch training on the class code generated by the class specific code generator are as follows:

step 2.2.1, converting the code-like characteristics into 128-dimensional contrast characteristics by using a layer of multi-layer perceptron;

step 2.2.2, measuring the similarity between different kinds of codes on the normalized class characteristics of the multi-layer perceptron codes, and optimizing by using a loss function of supervised contrast learning class-specific codes to improve the similarity and the inter-class variability of the multi-layer perceptron codes.

As an optimization, in step 2.2.2, it is assumed that there are two class encodings X of the same tag _i And X _j The loss function of the supervised contrast learning class specific code is:

in the formula (1),

a loss function value L calculated for a single sample _CCE An average loss function value for a meta-task;

in the formula (2)

Representing the number of samples belonging to the same class in a meta-task, II _k≠i For the indication function, 0 is taken if and only if k=i, otherwise 1, τ is the temperature parameter for optimization, +.>

Coding X for class _i 128-dimension normalized class characteristics obtained through multi-layer perceptron processing, wherein the molecules in the log function are X _i And X _j The characterization distance between them, the denominator of which is X _i Characterization distance from all data (positive and negative samples) in each meta-task.

As an optimization, in step 2.2, the class-specific code generator outputs a convolution kernel weight C associated with the class in a global average pooling manner _k To further parameterize the weight parameters of the object locator.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. according to the invention, under the training of a large number of basic class data (images) containing rich labels and a small number of small samples (few-shot) containing labels, the detection effect of the new class of test pictures is improved, namely the improvement of mAP and AR scores is improved.

2. The method of the invention has fewer misjudgments and can more accurately detect difficult targets in the image.

3. The invention can be easily migrated to other data sets for detection, which has important significance for the related work of small sample target detection.

Drawings

In order to more clearly illustrate the technical solutions of the exemplary embodiments of the present invention, the drawings that are needed in the examples will be briefly described below, it being understood that the following drawings only illustrate some examples of the present invention and therefore should not be considered as limiting the scope, and that other related drawings may be obtained from these drawings without inventive effort for a person skilled in the art. In the drawings:

fig. 1 is a schematic diagram of a network structure of a centrnet in an incremental target detection method based on no anchor according to the present invention;

fig. 2 is a schematic diagram of a network structure of a small sample target detection model in an incremental target detection method based on no anchor according to the present invention.

Fig. 3 is a schematic diagram of class code contrast learning branches in an incremental target detection method based on no anchor according to the present invention.

Fig. 4 is a schematic drawing of Precision and Recall curves.

FIGS. 5 and 6 are graphs comparing effects of ONCE (up) and an incremental target detection method based on no anchor according to the invention

Detailed Description

For the purpose of making apparent the objects, technical solutions and advantages of the present invention, the present invention will be further described in detail with reference to the following examples and the accompanying drawings, wherein the exemplary embodiments of the present invention and the descriptions thereof are for illustrating the present invention only and are not to be construed as limiting the present invention.

Research shows that training without mixing base class data (pictures) when a small amount of new class data (pictures) is injected into a small sample target detection model can cause great interference to the model, and the new class data is easy to misjudge as the trained base class data, so that the detection result is seriously affected. However, the injection of new data samples which are never seen into a model trained by a large number of base classes for detection is one of important works in the field of small sample target detection, because the new data samples with a small amount of marked data can be easily injected for detection.

The study divided the dataset into a base class and a new class with no class intersection at all, wherein the base class sample with rich tag information is included as feature guide, and a small number of new classes with tags are used for detecting model performance. After the constructed model is trained by the base class, the model can be effectively migrated to a new class sample which is not seen for detection, and a small sample set of the base class and the new class is not required to be constructed for retraining. Such a detection approach is very challenging, but allows easy injection of new sample classes for detection. A new class of small sample sets can be referenced "Xin Wang, thomas E.Huang, trevor Darrell, joseph E Gonzalez, and Fisher Yu.Frustratingly simple factor-shot object detection.In International Conference on Machine Learning (ICML), july 2020 @:

according to the incremental target detection method based on the no-anchor, target detection is divided into two types of anchor-free frames and anchor-free frames, wherein the anchor frames are predefined by a network, and complex calculation is carried out. The anchor-free frame is a detection target frame directly generated in the later stage of the network, and the incremental type is mentioned in the application, so that new data can be easily introduced, and the incremental type is directly added. Namely, a small sample target detection method based on contrast learning class specific coding comprises the following steps: and 1, selecting a target detection model. In order to construct an effective detection network that can be adapted to detect new class samples, it is first necessary to select an appropriate base class detection network, the selection not being arbitrary, since efficient and rapid detection of a small number of new class samples needs to be accomplished. In recent years, fast-RCNN is often used as a base class detection network, but it adopts a two-stage design and classifies based on softmax, and interaction between classes makes detecting independent new classes inflexible. Therefore, the target detection model selects a central Net detection network of an Anchor-free target detection algorithm capable of carrying out rapid and efficient detection as a base class network, and compared with networks such as YOLO, SSD and the like, the target detection model can achieve a trade-off in detection speed and precision and is easier to construct a specific class of characterization extraction module.

The idea of the network design of the CenterNet detection is derived from the idea of key point detection, a central point is found by adopting key point estimation, target attributes such as the height, the width and the like of a target bounding box are obtained by regression, post-processing such as key point grouping, non-maximum suppression and the like is not needed, and the network structure is shown in figure 1. The CenterNet transmits the training images into a full convolution encoding-decoding network to obtain multi-channel hematmap, wherein each category corresponds to one hematmap. The 2D hetmap peak points are the center points, and the peak point positions also predict the height and width information of the targets. Specific detection details are described later. The keypoint-based target detection framework not only eliminates region candidate boxes, but also can generate a predictive heat map unique to each class and perform independent detection through an activation threshold. The detection framework is very suitable for detecting the small sample target to inject a new class, and the basic class and the new class are not interfered with each other.

And 2, constructing a small sample target detection model based on the target detection model in the step 1.

In order to make the detection model output corresponding weight parameters for each type of image, effective characteristics of new types are extracted independently, and the target detection model-CenterNet network structure is adjusted. Instead of training and obtaining weight parameters in an original end-to-end mode, a training strategy of meta learning is introduced, and a new small sample target detection model is designed. The network structure is shown in fig. 2.

We consider a centrnet network as consisting of a feature extractor and a target locator, where the feature extractor network uses a res net residual network as the encoder, a deconvolution network as the decoder, and all new classes and base classes share weights. And a target locator containing specific weight parameters of each individual class to be detected, generating detection results of the input samples in the form of a hetmap using the 3D feature map output by the class-specific convolution kernel analysis feature extractor.

And then, introducing a class-specific code generator to generate convolution kernel weights in the target localizer, and replacing the original mode of updating the weights through iterative training. The specific code generator adopts the same encoder structure as the characteristic extraction network, but does not access the deconvolution network, and outputs weight parameters (convolution kernel weights) C related to the class in a global average pooling mode _k To further parameterize the parameters of the object locator, wherein the weight parameters C of the class-specific code generator network _k Parameters of the synthetic target locator network (such as the convolution kernel weight parameters associated with the synthetic class). In addition, a comparison learning branch is connected behind the class-specific encoder and is used for guiding a class-specific code generator to learn class-specific codes with comparison perception, calculating the feature similarity of the class-specific codes and better modeling the similarity and inter-class difference features.

Specifically, the specific steps of comparing and learning branch training for the class codes generated by the class specific code generator are as follows:

step 2.2.1, converting the code-like characteristics into 128-dimensional contrast characteristics by using a layer of multi-layer perceptron (1-layermodule-layer-perceptron (MLP) -head);

The loss function of the supervised contrast learning class specific code is formulated by self-supervision and supervised contrast learning-related work heuristics.

Under the supervision scenario, positive sample pairs are arranged among the characteristics of class codes in the same class, and the characteristics of class codes in different classes form negative sample pairs. Class code X for approximating identical labels _i And X _j Is different in distance and distanceThe distance of the label class codes designs a loss function of the supervised comparison learning class specific codes, as shown in formulas (1) and (2). Similarity between class-encoded features is measured by means of an inner product. In training, multiple meta-tasks are sampled, and the support set portion of each meta-task samples N samples from given data, denoted as { x } _k ,y _k } _k ＝1,2...,N，y _k Is x _k Is a label of (a).

In the formula (1)

Calculated loss, L for a single sample _CCE Is the average Loss of one meta-task.

In the formula (2)

Represents the number of samples belonging to the same class in a meta-task, II _k≠i To indicate a function, 0 is taken if and only if k=i, otherwise 1, τ is the temperature parameter [12 ] to be optimized]。/>

Coding X for class _i 128-dimensional normalized class features obtained through multi-layer perceptron (MLP-Head) processing. The numerator is X in log function _i And X _j The reproduction distance between them is X as the denominator _i The presentation distance from all data (positive and negative samples) in each meta-task. The design well characterizes the intra-class similarity and the inter-class difference, improves the coding performance of the class-specific encoder, and proves the effectiveness in later experiments.

And step 3, performing meta-training on the small sample target detection model.

By reference to the training strategy of meta-learning, in order to make full use of the basic categories, meta-training is divided into two serial stages: training a class feature extractor on a central Net detection network structure by using a large amount of basic class data in the first stage for extracting the features of a small amount of new class data; the second stage of training is divided into multiple subsections (epoode, the provenance of epoode is from Oriol vitamins, charles Blundell, timothy Lillicrap, daan Wierstra, et al matching networks for one shot training in neuroips, 2016.2t. -Y.Lin, P.Goyal, R.Girshick, K.He, and p. dol' ar.focal loss for dense object detection, iccv, 2017.), the epoode divides the labeled training data into query set and support set images, which are input into the feature extractor and class-specific encoder of the small sample object detection model, respectively, using a strategy of joint training class-specific encoder and object locator, wherein the object locator performs the location learning of the small sample object in combination with the class-encoded and extracted features.

In this training method, each epoode samples a plurality of meta-tasks, and the meta-tasks include different category combinations, and this mechanism enables the model to learn common parts in different meta-tasks, such as how to extract important features and compare the similarity of samples, forget task related parts in the meta-tasks, and so on. The model learned by the learning mechanism can better classify the model in the face of new and unseen meta-task, and is more beneficial to learning class coding.

First stage training: learning of feature extractors

On the base class data set, first random flip, random scale, clipping and color dithering are used as data enhancement T ₁ Standard supervised training is performed in a similar manner to that of a centrnet, given training images I e R of height h, width w ^h×w×3 A feature extractor f (& gt) based on a ResNet residual network is input to extract a feature map m=f (I),

r is a downsampling factor and m comprises c channelsCorresponding to c categories in the target object. Then in the target locator h, the convolution kernel c of the base class is learned _k ∈R ^1×1×c Convolving the feature map m to obtain the hetmap of each class>

Wherein the method comprises the steps of

Representing the detected keypoints, i.e. the c-class object is detected at the (x, y) coordinates. But->

It indicates that the current coordinate point does not have a c-class object, and the c-class object is regarded as a background.

As shown in the formula (3),

the number of channels is c+4. The c channels correspond to the predictions of c categories, and the remaining four channels respectively correspond to the prediction of the central point offset of the boundary frame and the prediction of the size length and width of the boundary frame. As indicated by the convolution operation, K _b Is the number of base class samples.

In training, we want to group-trunk key points

Calculated coordinates after downsampling for training, we set as +.>

Respectively passing through Gaussian kernel Y on the c channel characteristic diagrams obtained by downsampling _xyc The individual group-trunk key points are distributed on the hetmap to +.>

Is expressed in terms of (a). Y is Y _xyc The calculation mode of (2) is shown as a formula (4), wherein sigma _p Is an adaptive standard deviation associated with the target length and width. If the gaussian distributions generated by different objects of the same class overlap, a larger gaussian point is selected within the overlap.

The key point k on the Heatm is trained by using a loss function rewritten to the Focal loss 20 as shown in equation (5).

Where α and β are the hyperparameters of the Focal loss, experiments were performed with α=2 and β=4, N being the number of image I keypoints, see [13] for more details.

The feature map obtained by downsampling is re-mapped back onto the original image with a precision deviation, so that one offset needs to be predicted for each center point, that is,

the center points of all classes c share the same offset prediction, which is used by L ₁ loss is trained as shown in equation (6).

Wherein the method comprises the steps of

For predicting the resulting offset +.>

Representing values calculated in advance during the training process.

Assume that

For object k, the category is c _k The center point is

And regressing the size of each target k to finally regress to

S _k Is also calculated in advance as the length and width values after the group-trunk downsampling. We use the keypoint predictive network +.>

To predict all center points, to reduce the burden of regression calculation, +.>

As the predicted length and width values for each point in the hetmap, L is also used ₁ The regression loss function is trained. As shown in formula (7):

during prediction, the local peak point of each c category on the hetmap is extracted respectively

These peaks are activation values greater than or equal to Y _k The 8 neighborhood points in the model are probability values of the c-class target center points, and the first 100 points are reserved in a 3X3 maximum pooling mode. Calling out center points greater than the threshold value from the selected 100 peak points according to the set threshold value>

As a final result. And use +.>

And (5) regressing the target frame as a confidence index of the current point. The predicted target frame is shown in formula (8):

wherein the method comprises the steps of

Is a prediction of the center point offset. />

Wherein->

Is the size prediction of the bounding box. Both offset and bounding box use L ₁ The regression loss function is trained as shown in equation (6) and equation (7).

The training phase total loss consists of three parts Heatmap, offset, size for optimizing the parameters of the feature extractor f and the c-parameters of the locator. As shown in formula (9), lambda was set in the experiment _size =0.1 and λ _off ＝1。

L _det ＝L _k +λ _size L _size +λ _off L _off (9)

The goal at this stage is to learn only one feature extractor, while the goal locator is a conventional centrnet locator, which is discarded at the second stage, using only the trained feature extractor f ().

Training in the second stage: class-specific encoder learning

The class-related convolution kernel parameters obtained by training in the previous stage are only fixed parameters of the basic class, and in the second stage, the parameters of the characteristic extractor are frozen, and a class-specific encoder connected with a comparison learning branch is mainly trained, so that the class-specific encoder can efficiently encode a new class according to a small number of labeling samples. To efficiently train class-specific encoders, we use the epoode meta-learning training strategy [13].

The specific method comprises the following steps: the whole training data is divided into a plurality of epoode, each epoode contains a specified number of meta-tasks, and the meta-tasks sample a class label set L (for example, L=banana, umbrella … }) from all classes, and according to L, we choose a training sample of a support set S and a query set Q for each meta-task for training, wherein S and Q are randomly distributed by pictures of each class in the label set. Data enhancement T with random horizontal flipping, random cropping, and color dithering as support set images ₂ Image x in each support set is subjected to two data enhancement processes to generate sample x _a And x _b As a basic positive sample pair, the input class-specific encoder extracts features. Class-specific encoders are initialized with encoder weights of the feature extractor trained in the first stage. In the forward propagation process, for each meta-task, T is utilized ₁ And (3) carrying out data enhancement on the query set image, extracting the features of the query set by using a feature extractor trained in a stage, and obtaining a feature map of c channels according to a formula (10). Meanwhile, class-specific encoder utilizes pass-through T ₂ Generating corresponding weight parameters by processing the obtained support set image

The specific operation is shown in formula (11).

m _Q ＝f(I)，I∈Q (10)

Wherein m is _Q For querying the features of the set image I obtained via the feature extractor,

a support set sample for class k. Finally, 3X 256-dimensional class code is output by global averaging pooling->

On the one hand utilize support set SWith samples->

Correspondingly generated

Inputting a contrast learning branch, encoding the MLP-head into normalized contrast characteristics, then calculating similarity scores of different types of codes, and optimizing an objective function through back propagation. On the other hand, in order to facilitate comparison with other detection methods, only all x are taken _a Sample derived class code, class code of the same class +.>

Average pooling to get +.>

We gather image features m for queries of the same category _Q And these classes code->

Performing convolution operation in the input target positioner to generate a heat map +.>

Target detection of the query set image is completed as shown in equation (12).

Like the one-stage CenterNet, we use L ₁ loss, as shown in equation (13). The average absolute prediction error of Q is minimized by updating parameters of the class code generator and the target locator. Where n is the number of query set images and Z is group-trunk Heatm ap.

The total loss of the training phase consists of four parts, heatmap, offset, size and contrast loss, as shown in formula (14):

L _meta-det ＝L _Q +λ _size L _size +λ _off L _off +L _CCE (14)

and 4, performing meta-test on the trained small sample target detection model.

After the feature extractor, the class encoder and the target locator are subjected to meta-training, a small number of new class samples with labels are input to the class encoder to generate weight parameters of a specific class, and the target locator is parameterized. And the feature extractor performs feature extraction on the test picture to output a feature image, and then the target locator completes detection of the target in the test image. The new class of test pictures detect possible targets through forward propagation, and the model does not need to be trained or fine-tuned again.

A robust small sample detection model is obtained through two-stage training, and one or more new types of model injection are tested in a manner similar to the second-stage training. Firstly, inputting a support set of a group of new class samples into a model, and extracting class codes of the new class by a class specific encoder; inputting new test pictures to the model, and extracting features by a feature extractor to obtain a multi-channel feature map; and (3) inputting the two images into a target positioner, performing convolution operation in a mode of a formula (3) to obtain a thermodynamic diagram, extracting peak points in a threshold range in the thermodynamic diagram, and obtaining a target candidate frame according to a formula (8) to obtain a detection result of the test picture.

Under the training of a large number of basic class images containing rich labels and a small number of small sample (few-shot) new classes containing labels, the detection effect of the new class test pictures is improved, namely the improvement of mAP and AR scores is improved.

Where AP is (average precision) average precision and mAP is the average precision of m sample types. Precision is a prediction result that indicates how many samples that are predicted to be positive are true positive samples. (of all samples predicted to be positive samples, the proportion of the positive samples is actually) all predicted to be positive samples, how much the positive samples were correctly predicted (used to evaluate the predicted accuracy).

Recall (also known as TPR) is for the original sample and indicates how many positive examples in the sample were predicted to be correct. (the proportion of predicted positive samples in all actual positive samples) how much of all positive samples were correctly predicted (to evaluate how many correct samples were predicted).

As shown in fig. 4, the area below the pr curve is the AP value.

Experiments are carried out by adopting COCO2017[14] reference data sets commonly used for target detection, wherein 118287 training sets and 5000 verification sets are adopted, and total 80 object categories are covered, wherein 20 categories are used as new categories. The 20 categories are the same as those covered by the PASCAL VOC2007[15] data set, with the remaining 60 categories in the COCO data set as base categories. Thus, experiments can be divided into co-dataset evaluation and cross-dataset evaluation with 20 new classes of co dataset 60-based class PASCAL VOC datasets as tests.

COCO co data set evaluation:

the base class training image of the COCO data set is adopted for two stages of model element training, firstly, the size of the base class training image is adjusted to 512x512, supervised training is carried out according to the training mode of the first stage, then in the second stage, under the limitation of GPU memory, 28 meta-tasks are randomly extracted by each epoode, the meta-tasks comprise detection of 3 categories, each category comprises 5 label frames, and more tasks are beneficial to improvement of performance. For meta-testing, we randomly sampled multiple sets of support sets containing only 20 new class samples in the COCO training set for extracting class codes, each set of support sets being injected into the model once, the model being updated only once, wherein each new class contains only { k=1, 5,10} annotation boxes. And evaluate the performance of our small sample target detection model using the new class image on the COCO validation set as a test image. We compared our model with several other popular small sample target detection methods: 1) Standard Fine-Tuning detection model based on Faster RCNN; 2) Model-Agnostic Meta-Learning (MAML); 3) Non-incremental Few-shot object detection via feature reweighting; 4) Incremental small sample object detection network ONCE. The experimental results are shown in table 1. As can be seen from the results of the test of several trained small sample detection models injected into the new class in { k=1, 5,10}, our method has higher AP values than other comparative methods, and the best results are obtained for both AP and AR in the test of the base class and the mixed test of the base class and the new class. Meanwhile, in the case of { k=10 }, the pair of ONCE and our method detection results is as shown in fig. 5, it is easy to see that our method misjudges less and detects difficult targets in the image more accurately.

TABLE 1 COCO vs. dataset test results

VOC cross dataset evaluation:

in two stages of meta training, the base class data of the COCO data set are used for training, and the training mode is completely consistent with the evaluation method of the COCO same data set. The difference is that in the meta-test phase, the support sets of multiple new classes are sampled on the training set of the COCO2017 dataset, and likewise, each new class contains only { k=5, 10} label boxes, and the model is updated only once in a single injection model per support set. And uses the PASCAL VOC2007 test set images as test images to evaluate the performance of our small sample target detection model. The experimental comparison results are shown in table 2. In addition, in the test of ten label boxes for each new class sample in each support set, { k=10 }, a comparison graph of the detection effects of ONCE and our method is given in fig. 6. From the results, it can be seen that five and ten label boxes are sampled for each new class on the COCO training set, and tested on the PASCAL VOC test set, and the AP and AR values are better than the detection results on the COCO data set in the obtained detection results. It is illustrated that detecting images of a COCO dataset is more challenging than a PASCAL VOC dataset in a small sample target detection task. At the same time, it can be concluded that our framework can be easily migrated to other data sets for detection, which is of great importance for the relevant work of small sample target detection.

Table 2 VOC cross dataset detection comparison results

The invention has the advantages that the thought of supervised contrast learning is added into the small sample target detection model, the contrast learning branch is connected, the performance of model class specific coding is improved, and the problem of insufficient performance of the model on generalization of new classes is solved to a certain extent. The training strategy of the invention does not construct a balanced small sample set of the basic class and the new class for training in meta-training, but directly injects a small amount of new class samples which are never seen by the model for detection in the meta-test stage. The method is a challenging work, because the newly injected detection category is easily misjudged as the base category, the generalization performance of the model is very tested, but the work is the focus of research in the field of small sample target detection research, because by adopting the method, the method can easily introduce a new category sample for effective detection, and is more beneficial to future implementation into specific application scenes. The method is evaluated through main evaluation indexes AP and AR of target detection in an experimental part, so that the good detection effect can be obtained, and the problem that the existing detection model is insufficient in generalization performance of a new class is solved to a certain extent. Meanwhile, contrast learning benefits from more contrast sample size and more GPU memory, more mate-tasks can be sampled at each epoode, and performance can be further improved.

The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. The incremental target detection method based on the anchor-free is characterized by comprising the following steps of:

step 1, selecting a target detection model;

step 1.1, selecting a CenterNet detection network as a target detection model of a base class network;

step 2.2, introducing a class-specific code generator having a class encoder of the same structure as the encoder of the feature extractor for generating convolution kernel weights C _k And using the generated convolution kernel weights C _k Parameterizing the target locator; the class codes generated by the class specific code generator are subjected to contrast learning branch training so as to improve the consistency among the class codes of the same class and enlarge the difference of the class codes of different classes;

step 3, performing meta-training on the small sample target detection model;

step 3.3, performing joint training on the class specific code generator and the target positioner so that the target positioner performs positioning learning of new class data by combining class codes and extracted features;

training the class-specific encoder by adopting an epicode meta-learning training strategy:

dividing the whole training data into a plurality of epoode, wherein each epoode contains a designated number of meta-tasks, the meta-tasks sample a class label set L from all classes, and a training sample of a support set S and a query set Q is selected for training according to the L for each meta-task, wherein S and Q are obtained by randomly distributing pictures of each class in the label set;

data enhancement T with random horizontal flipping, random cropping, and color dithering as support set images ₂ Two data enhancement is performed on the image x in each support set to generate a sample x _a And x _b As a basic positive sample pair, inputting a class-specific encoder to extract a feature;

initializing the class-specific encoder with the encoder weights of the trained feature extractor;

in the forward propagation process, for each meta-task, T is utilized ₁ Data enhancement is carried out on the query set image, the feature extractor trained in one stage is utilized to extract the features of the query set, the feature image of c channels is obtained through a formula (10), and meanwhile, the class-specific encoder utilizes the data subjected to T ₂ Generating corresponding weight parameters by processing the obtained support set image

The specific operation is shown in the formula (11):

m _Q ＝f(I)，I∈Q (10)

wherein the method comprises the steps ofm _Q For querying the features of the set image I obtained via the feature extractor,

a support set sample for class k;

output 3X 256-dimensional class encoding by global averaging pooling

On the one hand use all samples in support set S +.>

Correspondingly generated->

Inputting a contrast learning branch, encoding the MLP-head into normalized contrast characteristics, then calculating similarity scores of different types of codes, optimizing an objective function through back propagation, and taking all x on the other hand _a Sample derived class code, class code of the same class +.>

Average pooling to get +.>

Image feature m for query sets of the same category _Q And these classes code->

As shown in equation (12), target detection of the query set image is completed,

by L ₁ loss, as shown in equation (13), by moreThe parameters of the new class code generator and target locator minimize the average absolute prediction error of Q, where n is the number of query set images, Z is the group-trunk-hetmap,

the total loss of training phase consists of Heatmap, offset, size, contrast loss four parts, as shown in equation (14):

L _meta-det ＝L _Q +λ _size L _size +λ _off L _off +L _CCE (14)。

2. the anchor-free incremental target detection method according to claim 1, wherein in step 4, the specific step of performing meta-test on the trained small sample target detection model is as follows:

3. The method for detecting an incremental target based on no anchor according to claim 1, wherein in step 2.2, the specific step of performing the contrast learning branch training on the class code generated by the class-specific code generator is as follows:

4. A method of anchor-free incremental object detection according to claim 3 wherein in step 2.2.2, it is assumed that there are two class encodings X of the same tag _i And X _j The loss function of the supervised contrast learning class specific code is:

in the formula (1),

a loss function value L calculated for a single sample _CCE An average loss function value of a meta-task, N is the number of samples, y _i Coding X for class _i Is a label of y _j Coding X for class _j Is a label of (2);

in the formula (2)

Coding X for class _i 128-dimension normalized class characteristics obtained through multi-layer perceptron processing, wherein the molecules in the log function are X _i And X _j The characterization distance between them, the denominator of which is X _i From the characterization distance of all data in each metatask including positive and negative samples.

5. An anchor-free incremental object detector in accordance with claim 1The method is characterized in that in the step 2.2, the class-specific code generator outputs the convolution kernel weight C related to the class in a global average pooling mode _k To further parameterize the weight parameters of the object locator.