CN113822368A

CN113822368A - Anchor-free incremental target detection method

Info

Publication number: CN113822368A
Application number: CN202111153974.XA
Authority: CN
Inventors: 符颖; 林弟忠; 胡金蓉; 文武; 邹书蓉; 周激流
Original assignee: Chengdu University of Information Technology
Current assignee: Chengdu University of Information Technology
Priority date: 2021-09-29
Filing date: 2021-09-29
Publication date: 2021-12-21
Anticipated expiration: 2041-09-29
Also published as: CN113822368B

Abstract

The invention relates to the field of image recognition, and particularly discloses an incremental target detection method based on no anchor, which comprises the following steps: step 1, selecting a target detection model; step 2, constructing a small sample target detection model based on the target detection model in the step 1; step 3, performing meta-training on the small sample target detection model; and 4, performing meta-testing on the trained small sample target detection model. According to the invention, under the training of a large amount of base class data (images) containing rich labels and a small amount of new classes of small samples (few-shot) containing labels, the detection effect of the new class test pictures is improved, namely the improvement of mAP and AR scores is improved.

Description

Anchor-free incremental target detection method

Technical Field

The invention relates to the field of image recognition, in particular to an incremental target detection method based on no anchor.

Background

Under the background of gradual maturity of a high-performance parallel computing technology and rapid development of a neural network, a target detection technology based on a deep learning method quickly replaces a manual-based feature extraction method. The main task of object detection is to locate objects of interest from the input image and then accurately judge the category of each object of interest. At present, mature target detection algorithms are successfully deployed in practical application scenes by means of large-scale labeled data, such as the fields of video monitoring, automobile unmanned driving, traffic scene detection and the like, but a large amount of labeled data is needed. The method is influenced by insufficient data labeling quantity, the actual application scene is not wide enough, and the detection task which can be carried out is single. The data labeling is very costly and labor-consuming, and it is not practical to obtain large-scale labeled data in most practical application scenes, which greatly limits the application of the existing target detection algorithm to more practical scenes.

Based on this, how to learn the target detection model with a certain generalization capability by using few labeled data is a more urgent research problem. In recent years, a lot of researchers aim at target detection work under a scene of a small amount of labeled data, namely small sample target detection. Most current research models are based on target detection frameworks such as the traditional fast-RCNN, the YOLO and the SSD, a meta-training strategy for small sample learning is used for reference, a small amount of labeled new sample classes which are not seen are injected into the detection models after a large amount of base class training, and two tasks of classification and regression of the new classes are completed, so that the research models are very challenging.

An attention-RPN module is introduced into a candidate frame region extraction network in Fan et al in 2019 to fuse the features of a query image and a support set image, a multi-association detector for learning feature relations in local, global and cross-correlation 3 aspects is provided, and a two-way comparison training strategy is adopted to perform similarity matching for detecting a new class. Wang et al [6] proposed in 2020 that fast-RCNN was used as a framework, training was performed in two stages, only classification and regression sub-networks were finely tuned in the second stage, and the combined weights of the features were readjusted to adapt to novel classes. Juan-Manual et al [7] proposed in 2020 to use the CenterNet framework for reference, and introduced the feature extractor for image feature extraction and the target locator for target location and a Resnet-50 network to extract the weight corresponding to the image output of each category and use the weight to complete the detection of the new category. In 2021, Bo Sun et al [9] introduced the concept of characteristic pyramid model and comparative learning into the model proposed by Wang et al, and the detection performance was improved by 2.7% on the COCO reference data set and by 8.8% on the standard PASCAL VOC data set. The detection method directly detects in a conventional reasoning form, can easily introduce a new class, is very efficient, has low requirement on new class data, and has obvious advantages in performance compared with the existing method. Juan-Manual et al only uses a Resnet50 network in extracting class-specific codes by using an ONCE network proposed by a CenterNet target detection framework for reference, and is not an optimal scheme in view of experimental results and has insufficient capability of extracting new classes of features.

Disclosure of Invention

In order to solve the problems, the invention provides an incremental target detection method based on no anchor, which can better extract the characteristic information specific to the class and optimize the detection result.

The invention is realized by the following technical scheme:

an incremental target detection method based on no anchor comprises the following steps:

step 1, selecting a target detection model;

step 2, constructing a small sample target detection model based on the target detection model in the step 1;

step 3, performing meta-training on the small sample target detection model;

and 4, performing meta-testing on the trained small sample target detection model.

As an optimization, in step 1, the specific steps of constructing the target detection model include:

step 1.1, selecting a CenterNet detection network as a target detection model of a base class network.

As an optimization, in step 2, the specific steps of constructing a small sample target detection model are as follows:

step 2.1, regarding the CenterNet detection network as consisting of a feature extractor and a target locator, wherein the feature extractor adopts a ResNet residual error network as an encoder, a deconvolution network as a decoder, and all new classes and base classes share weight; the target locator contains the convolution kernel weight of each individual class to be detected, and the target locator analyzes the 3D feature map output by the feature extractor by using a class-specific convolution kernel to generate a detection result of the input sample in the form of heat-maps;

step 2.2, introducing a class-specific code generator, wherein the class-specific code generator is provided with a class encoder with the same structure as the encoder of the feature extractor and is used for generating the convolution kernel weight C_kAnd using the generated convolution kernel weight C_kParameterizing the target locator; the class codes generated by the class specific code generator are subjected to comparison learning branch training so as to improve the consistency among the same class codes and expand the difference of different class codes.

As an optimization, in step 3, the meta-training of the small sample target detection model specifically includes:

step 3.1, training a class feature extractor on a CenterNet detection network by using a base class data set, wherein the class feature extractor is used for feature extraction of new class data;

step 3.2, dividing the base class data set with the label into a support set and a query set image, inputting the query set into a feature extractor, inputting the support set into the class specific code generator, extracting features from the base class data set by the feature extractor, and generating class codes related to the base class data set by the class specific codes;

and 3.3, performing combined training on the class specific code generator and the target locator, so that the target locator performs positioning learning on the new class data by combining the class codes and the extracted features.

As an optimization, in step 4, the specific steps of performing meta-testing on the trained small sample target detection model are as follows:

step 4.1, inputting a small amount of new class data with labels to a trained class-specific encoder to generate a weight parameter of a specific class, and parameterizing a target locator;

step 4.2, the class feature extractor extracts the features of the new class data and outputs a feature map,

and 4.3, completing the detection of the target in the test image by the parameterized target positioner.

As an optimization, in step 2.2, the specific step of performing the comparative learning branch training on the class code generated by the class-specific code generator is as follows:

2.2.1, converting the characteristics of class codes into 128-dimensional contrast characteristics by applying a layer of multilayer perceptron;

and 2.2.2, measuring the similarity between different classes of codes on the normalized class characteristics of the multilayer perceptron codes, and optimizing by using a loss function of the specific codes of the supervised contrast learning class so as to improve the intra-class similarity and the inter-class difference.

As an optimization, in step 2.2.2, it is assumed that there are two class codes X of the same label_iAnd X_jThe loss function for supervised contrast learning class specific coding is:

in the formula (1), the first and second groups,

calculated loss function value, L, for a single sample_CCEThe average loss function value of a meta task meta-task;

in the formula (2)

Representing the number of samples belonging to the same class in a meta-task, II_k≠iTo indicate the function, 0 is taken if and only if k ═ i, otherwise 1, τ is the temperature parameter being optimized,

encode X for class_iThe 128-dimensional normalized class characteristics are obtained by processing through a multilayer perceptron, and the molecules in the log function are X_iAnd X_jThe denominator of which is X_iToken distance from all data (positive and negative) in each meta-task.

As an optimization, in step 2.2, the class-specific code generator outputs class-dependent convolution kernel weights C by way of global average pooling_kTo further parameterize weight parameters of the target locator.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. according to the invention, under the training of a large amount of base class data (images) containing rich labels and a small amount of new classes of small samples (few-shot) containing labels, the detection effect of the new class test pictures is improved, namely the improvement of mAP and AR scores is improved.

2. The method of the invention has less misjudgment and more accurately detects the difficult target in the image.

3. The method can be easily migrated to other data sets for detection, and has important significance for the related work of small sample target detection.

Drawings

In order to more clearly illustrate the technical solutions of the exemplary embodiments of the present invention, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and that for those skilled in the art, other related drawings can be obtained from these drawings without inventive effort. In the drawings:

FIG. 1 is a schematic diagram of a CenterNet network structure in an incremental target detection method based on no anchor according to the present invention;

fig. 2 is a schematic network structure diagram of a small sample target detection model in the incremental target detection method based on anchorless according to the present invention.

Fig. 3 is a schematic diagram of class code comparison learning branches in the incremental target detection method based on anchorless according to the present invention.

FIG. 4 is a schematic diagram of Precision and Recall curves.

FIGS. 5 and 6 are graphs comparing the effects of ONCE (upper) and the incremental target detection method (lower) based on anchorless according to the present invention

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to examples and accompanying drawings, and the exemplary embodiments and descriptions thereof are only used for explaining the present invention and are not meant to limit the present invention.

Research shows that training without mixing base class data (pictures) causes great interference to a small amount of new class data (pictures) injected into a small sample target detection model, and the new class data is easily misjudged as the trained base class data, so that the detection result is seriously influenced. However, injecting new class data samples that have never been seen into a large number of base class trained models for detection is one of the important works in the field of small sample target detection, because it is easier to inject new class samples with a small amount of labeled data for detection.

The research divides the data set into a base class and a new class which have no category intersection completely, wherein a base class sample with rich label information is used as feature guidance, and a small amount of new classes with labels are used for detecting the performance of the model. After the constructed model is trained by the base class, the model can be effectively migrated to an unseen new class sample for detection, and a small sample set of the base class and the new class does not need to be constructed for retraining. Such a detection method is very challenging, but allows easy injection of new sample classes for detection. Small sample sets of the new class can be found in "Xin Wang, Thomas E.Huang, Trevor Darrell, Joseph E Gonzalez, and Fisher Yu.

According to the incremental target detection method based on the anchorless, target detection is divided into an anchorless frame and an anchorages frame, the anchorages frame is predefined by a network, and complex calculation is achieved. The anchor-free frame is a detection target frame directly generated in the later stage of the network, and the incremental type is the incremental type mentioned in the application, so that new data can be easily introduced, and the incremental type is directly added. Namely, the small sample target detection method based on the comparison learning class specific code comprises the following steps: step 1, selecting a target detection model. In order to construct an effective detection network which can be adapted to detect a new type of sample, a proper base class detection network needs to be selected at first, and the selection is not arbitrary because efficient and rapid detection on a small amount of new type of samples needs to be completed. In recent years, fast-RCNN is commonly used as a base class detection network, but the base class detection network adopts two-stage design and classification based on softmax, and the interaction between classes makes the detection of independent new classes inflexible. Therefore, the target detection model selects the CenterNet detection network of the Anchor-free target detection algorithm capable of performing rapid and efficient detection as the base class network, and compared with networks such as YOLO and SSD, the CenterNet detection network can achieve a balance in detection speed and precision and is easier to construct a specific class of representation extraction modules.

The concept of the centret detection network design comes from the idea of key point detection, the central point is found by adopting key point estimation, target attributes such as height, width and the like of a target enclosure frame are obtained by regression, post-processing such as key point grouping and non-maximum suppression is not needed, and the network structure is shown in fig. 1. The centret transmits the training images into a full convolutional encoding-decoding network to obtain multiple headmaps, each category corresponding to one headmap. The 2D heatmaps peak points are central points, and the position of the peak point also predicts the height and width information of the target. The details of the detection are described later. This key point-based object detection framework not only eliminates area candidate boxes, but also enables the generation of predictive heatmaps unique to each class and independent detection by activating thresholds. The detection framework is very suitable for the detection work of injecting a new class into the detection of a small sample target, and the base class and the new class are not interfered with each other.

And 2, constructing a small sample target detection model based on the target detection model in the step 1.

In order to enable the detection model to output corresponding weight parameters for each type of image, effective characteristics of a new type are independently extracted, and the target detection model-CenterNet network structure is adjusted. An original end-to-end mode is not adopted for training to obtain the weight parameters, a meta-learning training strategy is introduced, and a new small sample target detection model is designed. The network structure is shown in fig. 2.

We see the centret network as consisting of a feature extractor and a target locator, where the feature extraction network uses the ResNet residual network as the encoder, a deconvolution network as the decoder, and all new classes and base classes share weights. And the target locator, which contains the specific weight parameters for each individual class to be detected, analyzes the 3D feature map output by the feature extractor using class-specific convolution kernels, and generates the detection results of the input samples in the form of heatmaps.

And secondly, a class-specific code generator is introduced to generate the weight of a convolution kernel in the target locator, so that the original mode of updating the weight through iterative training is replaced. The specific code generator of the class adopts a coder structure the same as that of a feature extraction network, but does not access a deconvolution network, and outputs a weight parameter (convolution kernel weight) C related to the class in a global average pooling mode_kTo further parameterize the parameters of the object locator, wherein the weight parameter C of the class-specific code generator network_kParameters of the synthetic target locator network (such as synthetic class dependent convolution kernel weight parameters). In addition, a contrast learning branch is connected behind the class-specific encoder and used for guiding the class-specific code generator to learn the class-specific codes with contrast perception, calculating the feature similarity of the class-specific codes and better modeling the intra-class similarity and the inter-class difference features.

Specifically, the specific step of performing the comparative learning branch training on the class code generated by the class specific code generator is as follows:

step 2.2.1, applying a layer of multilayer perceptron (1-layermulti-layer-perceptron (MLP) -head) to convert the characteristics of class codes into 128-dimensional contrast characteristics;

The loss function of the specific code of the supervised contrast learning class is formulated by the inspiration of the related work of the self-supervision and the supervised contrast learning.

Under the supervision situation, the features of the class codes of the same class are all positive sample pairs, and the features of the class codes of different classes form negative sample pairs. Class code X for zooming in to the same tag_iAnd X_jThe distance of the label class codes is increased, and a loss function of the specific codes of the supervised contrast learning class is designed, as shown in formulas (1) and (2). And measuring the similarity between class coding features by adopting an inner product mode. Multiple meta-tasks are sampled in the training, and each meta-task has a support set part sampling N samples from given data, denoted as { x }_k,y_k}_k＝1,2...,N，y_kIs x_kThe label of (1).

In the formula (1)

Calculated loss, L for a single sample_CCEIs the average Loss of one meta-task.

In the formula (2)

Representing the number of samples belonging to the same class in a meta-task, II_k≠iTo indicate the function, 0 is taken if and only if k ═ i, otherwise 1, τ is the temperature parameter [12 ] being optimized]。

Encode X for class_iAnd (4) obtaining 128-dimensional normalized class characteristics through processing of a multilayer perceptron (MLP-Head). The numerator in log function is X_iAnd X_jThe denominator of the representation distance between them is X_iThe representation distance from all data (positive and negative) in each meta-task. The design well delineates intra-class similarity and inter-class difference, improves the coding performance of the class-specific coder, and proves the effectiveness of the class-specific coder in later experiments.

And 3, performing meta-training on the small sample target detection model.

By using the training strategy of meta-learning, in order to make full use of basic categories, the meta-training is divided into two serial stages: in the first stage, a class feature extractor is trained on a CenterNet detection network structure by using a large amount of base class data and is used for feature extraction of a small amount of new class data; the second stage of training is divided into a plurality of sections (epicode, the source of epicode is from Oriol Vinyals, Charles Blundell, Timothy lilacrap, Daan Wierstra, et al. matching networks for one shot training. in neuroips, 2016.2t. -y.lin, p.goyal, r.girshick, k.he, and p.doll' ar.focal distance for noise object detection. iccv,2017.), the labeled training data is divided into query and support set images, which are input into the feature extractor and class-specific of the small sample object detection model, respectively, and the encoder strategy of the joint training class-specific encoder and the object locator is adopted, wherein the object locator performs the location learning of the small sample object by combining the class encoding and extracted features.

In the training mode, each epicode can sample to obtain a plurality of meta-tasks, the meta-tasks comprise different category combinations, and the mechanism enables the model to learn common parts in different meta-tasks, such as how to extract important features and compare the similarity of samples, forgetting related parts of the tasks in the meta-tasks and the like. The model learned by the learning mechanism can better classify when facing new unseen meta-task, and is more beneficial to learning class codes.

Training in the first stage: learning of feature extractors

On the base class data set, the T is enhanced by taking random turning, random scaling, clipping and color dithering as data₁Standard supervised training is performed in a similar manner to CenterNet, giving a training image I e R of height h, width w^h×w×3Inputting a feature extractor f () based on a ResNet residual error network to extract a feature map m ═ f (I),

r is the down-sampling factor and m contains c channels, corresponding to c classes in the target object. Then, in the target locator h, the convolution kernel c of the base class is learned_k∈R^1×1×cPerforming convolution operation on the feature map m to obtain heatmap of each class

Wherein

Representing the detected keypoints, i.e. the c-class target object was detected in (x, y) coordinates. While

It means that there is no c-type object at the current coordinate point and it is considered as background.

As shown in the formula (3),

the number of channels is c + 4. And the remaining four channels respectively correspond to the prediction of the offset of the center point of the boundary box and the prediction of the size length and the width of the boundary box. An indicates a convolution operation, K_bIs the number of base class samples.

In training, we want to set the ground-route key point

The coordinates after down-sampling, calculated for training, are set to

Respectively passing through Gaussian kernels Y on c channel characteristic graphs obtained by down sampling_xycDistributing each group-route key point on the heatmap to

Is expressed in terms of the form. Y is_xycIs calculated as shown in formula (4), where σ_pIs an adaptive standard deviation related to the target length and width. If the Gaussian distributions generated by different objects in the same class are overlapped, a larger Gaussian point is selected in the overlapped range.

The training of key point k on Heatmap uses a loss function that rewrites Focal loss [20], as shown in equation (5).

Where α and β are hyper-parameters of Focal loss, and α ═ 2 and β ═ 4 are set for experiments, and N is the number of key points in the image I, see [13] for more details.

The down-sampled feature map is remapped back to the original image with a deviation in accuracy, so that an offset needs to be predicted for each center point, i.e.,

the center points of all classes c share the same offset prediction, L₁loss, as shown in equation (6).

Wherein

For predicting the resulting offset

Representing values calculated in advance during the training process.

Suppose that

Is target k, and belongs to class c_kThe central point is

And regressing the size of each target k to finally regress to

S_kAlso calculated in advance, is the length and width values after the ground-truth downsampling. We predict networks using keypoints

To predict all central points, use is made of

As the predicted length and width value for each point in heatmap, L is also used₁The regression loss function is trained. As shown in equation (7):

when prediction is carried out, local peak points of each c category on heatmaps are respectively extracted

These peaks are activation values greater than or equal to Y_kAnd (4) preserving the first 100 points of the 8 th neighborhood by adopting a 3X3 maximum pooling mode, wherein the peak values are probability values of the c-th class target center points. According to the set threshold value, calling out the central point greater than the threshold value from the selected 100 peak value points

As a final result. And use

And regressing a target box as a confidence index of the current point. The predicted target box is shown in equation (8):

wherein

Is a prediction of the center point offset.

Wherein

Is the size prediction of the bounding box. Offset and bounding box both use L₁The regression loss function is trained as shown in equation (6) and equation (7).

The total loss of the training phase is composed of three parts, Heatmap, offset, and size, and is used to optimize the parameters of the feature extractor f and the c parameters of the locator. As shown in the formula (9), λ was set in the experiment_size0.1 and λ_off＝1。

L_det＝L_k+λ_sizeL_size+λ_offL_off (9)

The goal at this stage is simply to learn a feature extractor, while the target locator is a conventional CenterNet locator, and discarded at the second stage, using only the trained feature extractor f ().

And (3) training in the second stage: class specific encoder learning

The class-related convolution kernel parameters obtained by training in the previous stage are only fixed parameters of a base class, and in the second stage, parameters of a feature extractor are frozen, and a class-specific encoder connected with a comparison learning branch is mainly trained, so that the class-specific encoder can efficiently synthesize a new class according to a small number of labeled samples. To efficiently train class-specific encoders, we employ an epicode meta-learning training strategy [13 ].

The specific method comprises the following steps: the whole training data is divided into a plurality of epasopodes, each epasopode contains a specified number of meta-tasks, and the meta-tasks sample a class label set L (for example: L-Banana, umbrella …) from all classes, and according to L, we select a training sample of a support set S and a query set Q for each meta-task for training, wherein S and Q are obtained by randomly distributing pictures of each class in the label set. Data enhancement T for images with random horizontal flipping, random cropping and color dithering as support sets₂The image x in each support set is subjected to two data enhancements to generate a sample x_aAnd x_bAs a basic positive sample pair, the input class specific encoder extracts features. The class specific encoder is initialized with the encoder weights of the feature extractor trained in the first stage. In the forward propagation process, for each meta-task, T is utilized₁And (3) performing data enhancement on the image of the query set, and extracting the features of the query set by using a feature extractor trained in one stage, wherein a feature graph of c channels is obtained according to a formula (10). At the same time, class specific encoder utilizes a pass T₂Generating corresponding weight parameters by the processed support set image

The specific operation is shown in formula (11).

m_Q＝f(I)，I∈Q (10)

Wherein m is_QThe features obtained by the feature extractor for the query set image I,

is a support set sample of class k. Finally, 3X256 dimensional class code is output through global average pooling

On the one hand, all samples in the support set S are utilized

Generated correspondingly

Inputting a contrast learning branch, and encoding the contrast learning branch into normalized contrast characteristics by an MLP-head, then calculating similarity scores of different types of encoding, and optimizing an objective function through back propagation. On the other hand, in order to facilitate comparison with other detection methods, only all x are taken_aClass coding obtained from sample, and class coding of the same class

Is subjected to average pooling to obtain

We are interested in the image features m of the same category of query set_QAnd coding of these classes

Inputting into a target locator for convolution operation to generate a heat map

Target detection for the query set image is accomplished as shown in equation (12).

Similar to a one-stage CenterNet, we use L₁loss, as shown in equation (13). The average absolute prediction error of Q is minimized by updating the parameters of the class code generator and the target locator. Wherein n is the number of images in the query set, and Z is the group-route heatmap.

The total loss of the training phase is composed of four parts of Heatmap, offset, size, and contrast loss, as shown in equation (14):

L_meta-det＝L_Q+λ_sizeL_size+λ_offL_off+L_CCE (14)

After the feature extractor, the class encoder and the target locator are subjected to meta-training, a small number of new class samples with labels are input into the class encoder to generate weight parameters of a specific class, and the target locator is parameterized. And the feature extractor extracts features of the test picture to output a feature map, and then the target locator finishes the detection of the target in the test picture. The new class of test pictures are propagated forward to detect possible targets, and the model does not need to be trained or fine-tuned again.

The robust small sample detection model is obtained through two-stage training, and one or more new classes are injected into the model to be tested in a mode similar to the second-stage training. Firstly, inputting a group of support sets of new class samples into a model, and extracting class codes of the new classes by a class specific coder; inputting a new type of test picture into the model, and extracting the features by using a feature extractor to obtain a multi-channel feature map; inputting the two into a target locator, performing convolution operation according to a formula (3) to obtain a thermodynamic diagram, extracting peak points in a threshold range in the thermodynamic diagram, and acquiring a target candidate frame according to a formula (8) to obtain a detection result of the test picture.

Under the training of a large number of base class images containing rich labels and a small number of new classes of small samples (few-shot) containing labels, the detection effect of the new classes of test pictures is improved, namely the improvement of mAP and AR scores is improved.

Where AP is the (average precision) and mapp is the average precision of m sample classes. Precision (Precision) is for the prediction result, which indicates how many of the samples predicted to be positive are true positive samples. (the proportion of all samples predicted as positive samples that are actually positive samples) how much of the positive samples are correctly predicted (for evaluating the accuracy of prediction) in all samples predicted as positive samples.

Recall (recalls, also known as TPR) is for the original sample, which indicates how many of the positive examples in the sample were predicted correctly. (the ratio of predicted positive samples among all actual positive samples) how many of all positive samples were correctly predicted (to evaluate how many correct samples were predicted).

As shown in fig. 4, the area under the PR curve is the AP value.

The experiment is carried out by adopting a COCO2017[14] reference data set commonly used for target detection, wherein 118287 training sets and 5000 verification sets cover 80 object classes in total, and 20 classes serve as new classes. The 20 classes are the same as those covered by the PASCAL VOC2007[15] dataset, with the remaining 60 classes in the COCO dataset as the base classes. Thus, the experiment can be divided into co-dataset evaluation and cross-dataset evaluation with 20 new classes of PASCAL VOC datasets as the base class 60 of the COCO dataset as the test.

COCO was assessed with dataset:

the method comprises the steps of using a base class training image of a COCO data set for two stages of model element training, firstly adjusting the size of the base class training image to 512x512, carrying out supervised training according to the training mode of the first stage, and then in the second stage, setting each epsilon to randomly extract 28 meta-tasks under the limitation of a GPU memory, wherein the meta-tasks comprise detection of 3 classes, each class comprises 5 labeling frames, and more tasks are beneficial to performance improvement. For meta-testing, we use multiple randomly sampled sets of support sets containing only 20 new class samples in the COCO training set for extracting class codes, where each set of support sets is injected into the model once and the model is updated only once, where each new class contains only { k ═ 1,5,10} labeled boxes. And the performance of our small sample target detection model was evaluated using the new class of images on the COCO validation set as test images. We compared our model with several other popular small sample target detection methods: 1) a standard Fine-Tuning detection model based on fast RCNN; 2) model-inflammatory Meta-learning (MAML); 3) non-incremental Few-shot object detection view feature reweighing; 4) incremental small sample object detection network ONCE. The results of the experiment are shown in table 1. In { k ═ 1,5,10}, the test results of injecting new classes into several small sample detection models after training show that the AP value of our method is higher than that of other comparison methods, and the AP and AR have the best performance in the test of the base class and the mixed test of the base class and the new class. Meanwhile, in the case of { k ═ 10}, the pair of detection results of the ONCE and our method is as shown in fig. 5, so that it is easy to see that our method has fewer misjudgments and detects a difficult object in an image more accurately.

TABLE 1 COCO comparison with dataset detection results

VOC evaluation across datasets:

in two stages of meta-training, the base class data of the COCO data set is also adopted for training, and the training mode is completely consistent with the COCO same data set evaluation method. The difference is that in the meta-test phase, a plurality of support sets of new classes are sampled on the training set of the COCO2017 data set, and similarly, in each set of support set, the model is updated only once, wherein each new class only contains { k ═ 5,10} labeled boxes. And the performance of our small sample target detection model was evaluated using the PASCAL VOC2007 test set images as test images. The experimental comparison results are shown in table 2. In addition, in the test of ten labeled boxes for each new class sample in each set of support sets, namely { k ═ 10}, a comparison graph of the detection effect of the ONCE and our method is given as fig. 6. From the results, it can be seen that five and ten labeled boxes are sampled for each new class on the COCO training set, and the test is performed on the PASCAL VOC test set, and in the obtained detection results, both the AP and AR values are superior to those on the COCO data set. It is demonstrated that the detection of images of the COCO dataset is more challenging than the PASCAL VOC dataset in the small sample object detection task. Meanwhile, it can be concluded that our framework can be easily migrated to other data sets for detection, which is of great significance for the related work of small sample target detection.

TABLE 2 VOC Cross-dataset detection comparison results

The method has the idea of supervised contrast learning to the small sample target detection model, and the access of the contrast learning branch improves the performance of the specific coding of the model class, thereby solving the problem of insufficient generalization performance of the model to the new class to a certain extent. The training strategy of the invention does not construct a balance small sample set of a base class and a new class in meta-training for training, but directly injects a small amount of new class samples which are never seen in a model in a meta-testing stage for detection. The method is extremely challenging, the newly injected detection category is easily judged as a base class by mistake, the generalization performance of the model is very tested, but the work is the key point of the research of the small sample target detection research field, and by adopting the mode, the method can easily introduce the new type sample for effective detection, and is more beneficial to the future implementation in a specific application scene. In the experimental part, the method is evaluated through the main evaluation indexes AP and AR of target detection, so that a good detection effect is obtained, and the problem that the existing detection model is insufficient in generalization performance of a new class is solved to a certain extent. Meanwhile, the comparison learning benefits from more comparison sample size and more GPU memory, and the invention can sample more mate-tasks per epamode and further improve the performance.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. An incremental target detection method based on no anchor is characterized by comprising the following steps:

step 1, selecting a target detection model;

step 3, performing meta-training on the small sample target detection model;

2. The method for incrementally detecting the target based on the anchorless as claimed in claim 1, wherein in the step 1, the specific step of constructing the target detection model comprises:

3. The method for incremental target detection based on anchorless according to claim 2, wherein in step 2, the specific steps of constructing the small sample target detection model are as follows:

4. The method as claimed in claim 3, wherein in step 3, the step of meta-training the small sample target detection model comprises:

5. The method for incremental target detection based on anchorless as claimed in claim 4, wherein in step 4, the step of performing meta-test on the trained small sample target detection model comprises the following specific steps:

6. The method according to claim 3, wherein in step 2.2, the specific step of performing the contrast learning branch training on the class code generated by the class-specific code generator is as follows:

7. The method as claimed in claim 6, wherein in step 2.2.2, it is assumed that there are two class codes X with the same label_iAnd X_jThe loss function for supervised contrast learning class specific coding is:

in the formula (1), the first and second groups,

calculated loss function value, L, for a single sample_CCEAn average loss function value for one element task;

formula (2)

encode X for class_iThe 128-dimensional normalized class characteristics are obtained by processing through a multilayer perceptron, and the molecules in the log function are X_iAnd X_jThe denominator of which is X_iToken distance from all data in each meta-task including positive and negative examples.

8. The method according to claim 1, wherein in step 2.2, the class-specific code generator outputs class-dependent convolution kernel weights C by means of global average pooling_kTo further parameterize weight parameters of the target locator.