CN113822368A - Anchor-free incremental target detection method - Google Patents

Anchor-free incremental target detection method Download PDF

Info

Publication number
CN113822368A
CN113822368A CN202111153974.XA CN202111153974A CN113822368A CN 113822368 A CN113822368 A CN 113822368A CN 202111153974 A CN202111153974 A CN 202111153974A CN 113822368 A CN113822368 A CN 113822368A
Authority
CN
China
Prior art keywords
class
target detection
target
specific
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111153974.XA
Other languages
Chinese (zh)
Other versions
CN113822368B (en
Inventor
符颖
林弟忠
胡金蓉
文武
邹书蓉
周激流
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu University of Information Technology
Original Assignee
Chengdu University of Information Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu University of Information Technology filed Critical Chengdu University of Information Technology
Priority to CN202111153974.XA priority Critical patent/CN113822368B/en
Publication of CN113822368A publication Critical patent/CN113822368A/en
Application granted granted Critical
Publication of CN113822368B publication Critical patent/CN113822368B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to the field of image recognition, and particularly discloses an incremental target detection method based on no anchor, which comprises the following steps: step 1, selecting a target detection model; step 2, constructing a small sample target detection model based on the target detection model in the step 1; step 3, performing meta-training on the small sample target detection model; and 4, performing meta-testing on the trained small sample target detection model. According to the invention, under the training of a large amount of base class data (images) containing rich labels and a small amount of new classes of small samples (few-shot) containing labels, the detection effect of the new class test pictures is improved, namely the improvement of mAP and AR scores is improved.

Description

Anchor-free incremental target detection method
Technical Field
The invention relates to the field of image recognition, in particular to an incremental target detection method based on no anchor.
Background
Under the background of gradual maturity of a high-performance parallel computing technology and rapid development of a neural network, a target detection technology based on a deep learning method quickly replaces a manual-based feature extraction method. The main task of object detection is to locate objects of interest from the input image and then accurately judge the category of each object of interest. At present, mature target detection algorithms are successfully deployed in practical application scenes by means of large-scale labeled data, such as the fields of video monitoring, automobile unmanned driving, traffic scene detection and the like, but a large amount of labeled data is needed. The method is influenced by insufficient data labeling quantity, the actual application scene is not wide enough, and the detection task which can be carried out is single. The data labeling is very costly and labor-consuming, and it is not practical to obtain large-scale labeled data in most practical application scenes, which greatly limits the application of the existing target detection algorithm to more practical scenes.
Based on this, how to learn the target detection model with a certain generalization capability by using few labeled data is a more urgent research problem. In recent years, a lot of researchers aim at target detection work under a scene of a small amount of labeled data, namely small sample target detection. Most current research models are based on target detection frameworks such as the traditional fast-RCNN, the YOLO and the SSD, a meta-training strategy for small sample learning is used for reference, a small amount of labeled new sample classes which are not seen are injected into the detection models after a large amount of base class training, and two tasks of classification and regression of the new classes are completed, so that the research models are very challenging.
An attention-RPN module is introduced into a candidate frame region extraction network in Fan et al in 2019 to fuse the features of a query image and a support set image, a multi-association detector for learning feature relations in local, global and cross-correlation 3 aspects is provided, and a two-way comparison training strategy is adopted to perform similarity matching for detecting a new class. Wang et al [6] proposed in 2020 that fast-RCNN was used as a framework, training was performed in two stages, only classification and regression sub-networks were finely tuned in the second stage, and the combined weights of the features were readjusted to adapt to novel classes. Juan-Manual et al [7] proposed in 2020 to use the CenterNet framework for reference, and introduced the feature extractor for image feature extraction and the target locator for target location and a Resnet-50 network to extract the weight corresponding to the image output of each category and use the weight to complete the detection of the new category. In 2021, Bo Sun et al [9] introduced the concept of characteristic pyramid model and comparative learning into the model proposed by Wang et al, and the detection performance was improved by 2.7% on the COCO reference data set and by 8.8% on the standard PASCAL VOC data set. The detection method directly detects in a conventional reasoning form, can easily introduce a new class, is very efficient, has low requirement on new class data, and has obvious advantages in performance compared with the existing method. Juan-Manual et al only uses a Resnet50 network in extracting class-specific codes by using an ONCE network proposed by a CenterNet target detection framework for reference, and is not an optimal scheme in view of experimental results and has insufficient capability of extracting new classes of features.
Disclosure of Invention
In order to solve the problems, the invention provides an incremental target detection method based on no anchor, which can better extract the characteristic information specific to the class and optimize the detection result.
The invention is realized by the following technical scheme:
an incremental target detection method based on no anchor comprises the following steps:
step 1, selecting a target detection model;
step 2, constructing a small sample target detection model based on the target detection model in the step 1;
step 3, performing meta-training on the small sample target detection model;
and 4, performing meta-testing on the trained small sample target detection model.
As an optimization, in step 1, the specific steps of constructing the target detection model include:
step 1.1, selecting a CenterNet detection network as a target detection model of a base class network.
As an optimization, in step 2, the specific steps of constructing a small sample target detection model are as follows:
step 2.1, regarding the CenterNet detection network as consisting of a feature extractor and a target locator, wherein the feature extractor adopts a ResNet residual error network as an encoder, a deconvolution network as a decoder, and all new classes and base classes share weight; the target locator contains the convolution kernel weight of each individual class to be detected, and the target locator analyzes the 3D feature map output by the feature extractor by using a class-specific convolution kernel to generate a detection result of the input sample in the form of heat-maps;
step 2.2, introducing a class-specific code generator, wherein the class-specific code generator is provided with a class encoder with the same structure as the encoder of the feature extractor and is used for generating the convolution kernel weight CkAnd using the generated convolution kernel weight CkParameterizing the target locator; the class codes generated by the class specific code generator are subjected to comparison learning branch training so as to improve the consistency among the same class codes and expand the difference of different class codes.
As an optimization, in step 3, the meta-training of the small sample target detection model specifically includes:
step 3.1, training a class feature extractor on a CenterNet detection network by using a base class data set, wherein the class feature extractor is used for feature extraction of new class data;
step 3.2, dividing the base class data set with the label into a support set and a query set image, inputting the query set into a feature extractor, inputting the support set into the class specific code generator, extracting features from the base class data set by the feature extractor, and generating class codes related to the base class data set by the class specific codes;
and 3.3, performing combined training on the class specific code generator and the target locator, so that the target locator performs positioning learning on the new class data by combining the class codes and the extracted features.
As an optimization, in step 4, the specific steps of performing meta-testing on the trained small sample target detection model are as follows:
step 4.1, inputting a small amount of new class data with labels to a trained class-specific encoder to generate a weight parameter of a specific class, and parameterizing a target locator;
step 4.2, the class feature extractor extracts the features of the new class data and outputs a feature map,
and 4.3, completing the detection of the target in the test image by the parameterized target positioner.
As an optimization, in step 2.2, the specific step of performing the comparative learning branch training on the class code generated by the class-specific code generator is as follows:
2.2.1, converting the characteristics of class codes into 128-dimensional contrast characteristics by applying a layer of multilayer perceptron;
and 2.2.2, measuring the similarity between different classes of codes on the normalized class characteristics of the multilayer perceptron codes, and optimizing by using a loss function of the specific codes of the supervised contrast learning class so as to improve the intra-class similarity and the inter-class difference.
As an optimization, in step 2.2.2, it is assumed that there are two class codes X of the same labeliAnd XjThe loss function for supervised contrast learning class specific coding is:
Figure BDA0003287955620000031
Figure BDA0003287955620000032
in the formula (1), the first and second groups,
Figure BDA0003287955620000033
calculated loss function value, L, for a single sampleCCEThe average loss function value of a meta task meta-task;
in the formula (2)
Figure BDA0003287955620000034
Representing the number of samples belonging to the same class in a meta-task, IIk≠iTo indicate the function, 0 is taken if and only if k ═ i, otherwise 1, τ is the temperature parameter being optimized,
Figure BDA0003287955620000035
encode X for classiThe 128-dimensional normalized class characteristics are obtained by processing through a multilayer perceptron, and the molecules in the log function are XiAnd XjThe denominator of which is XiToken distance from all data (positive and negative) in each meta-task.
As an optimization, in step 2.2, the class-specific code generator outputs class-dependent convolution kernel weights C by way of global average poolingkTo further parameterize weight parameters of the target locator.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. according to the invention, under the training of a large amount of base class data (images) containing rich labels and a small amount of new classes of small samples (few-shot) containing labels, the detection effect of the new class test pictures is improved, namely the improvement of mAP and AR scores is improved.
2. The method of the invention has less misjudgment and more accurately detects the difficult target in the image.
3. The method can be easily migrated to other data sets for detection, and has important significance for the related work of small sample target detection.
Drawings
In order to more clearly illustrate the technical solutions of the exemplary embodiments of the present invention, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and that for those skilled in the art, other related drawings can be obtained from these drawings without inventive effort. In the drawings:
FIG. 1 is a schematic diagram of a CenterNet network structure in an incremental target detection method based on no anchor according to the present invention;
fig. 2 is a schematic network structure diagram of a small sample target detection model in the incremental target detection method based on anchorless according to the present invention.
Fig. 3 is a schematic diagram of class code comparison learning branches in the incremental target detection method based on anchorless according to the present invention.
FIG. 4 is a schematic diagram of Precision and Recall curves.
FIGS. 5 and 6 are graphs comparing the effects of ONCE (upper) and the incremental target detection method (lower) based on anchorless according to the present invention
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to examples and accompanying drawings, and the exemplary embodiments and descriptions thereof are only used for explaining the present invention and are not meant to limit the present invention.
Research shows that training without mixing base class data (pictures) causes great interference to a small amount of new class data (pictures) injected into a small sample target detection model, and the new class data is easily misjudged as the trained base class data, so that the detection result is seriously influenced. However, injecting new class data samples that have never been seen into a large number of base class trained models for detection is one of the important works in the field of small sample target detection, because it is easier to inject new class samples with a small amount of labeled data for detection.
The research divides the data set into a base class and a new class which have no category intersection completely, wherein a base class sample with rich label information is used as feature guidance, and a small amount of new classes with labels are used for detecting the performance of the model. After the constructed model is trained by the base class, the model can be effectively migrated to an unseen new class sample for detection, and a small sample set of the base class and the new class does not need to be constructed for retraining. Such a detection method is very challenging, but allows easy injection of new sample classes for detection. Small sample sets of the new class can be found in "Xin Wang, Thomas E.Huang, Trevor Darrell, Joseph E Gonzalez, and Fisher Yu.
According to the incremental target detection method based on the anchorless, target detection is divided into an anchorless frame and an anchorages frame, the anchorages frame is predefined by a network, and complex calculation is achieved. The anchor-free frame is a detection target frame directly generated in the later stage of the network, and the incremental type is the incremental type mentioned in the application, so that new data can be easily introduced, and the incremental type is directly added. Namely, the small sample target detection method based on the comparison learning class specific code comprises the following steps: step 1, selecting a target detection model. In order to construct an effective detection network which can be adapted to detect a new type of sample, a proper base class detection network needs to be selected at first, and the selection is not arbitrary because efficient and rapid detection on a small amount of new type of samples needs to be completed. In recent years, fast-RCNN is commonly used as a base class detection network, but the base class detection network adopts two-stage design and classification based on softmax, and the interaction between classes makes the detection of independent new classes inflexible. Therefore, the target detection model selects the CenterNet detection network of the Anchor-free target detection algorithm capable of performing rapid and efficient detection as the base class network, and compared with networks such as YOLO and SSD, the CenterNet detection network can achieve a balance in detection speed and precision and is easier to construct a specific class of representation extraction modules.
The concept of the centret detection network design comes from the idea of key point detection, the central point is found by adopting key point estimation, target attributes such as height, width and the like of a target enclosure frame are obtained by regression, post-processing such as key point grouping and non-maximum suppression is not needed, and the network structure is shown in fig. 1. The centret transmits the training images into a full convolutional encoding-decoding network to obtain multiple headmaps, each category corresponding to one headmap. The 2D heatmaps peak points are central points, and the position of the peak point also predicts the height and width information of the target. The details of the detection are described later. This key point-based object detection framework not only eliminates area candidate boxes, but also enables the generation of predictive heatmaps unique to each class and independent detection by activating thresholds. The detection framework is very suitable for the detection work of injecting a new class into the detection of a small sample target, and the base class and the new class are not interfered with each other.
And 2, constructing a small sample target detection model based on the target detection model in the step 1.
In order to enable the detection model to output corresponding weight parameters for each type of image, effective characteristics of a new type are independently extracted, and the target detection model-CenterNet network structure is adjusted. An original end-to-end mode is not adopted for training to obtain the weight parameters, a meta-learning training strategy is introduced, and a new small sample target detection model is designed. The network structure is shown in fig. 2.
We see the centret network as consisting of a feature extractor and a target locator, where the feature extraction network uses the ResNet residual network as the encoder, a deconvolution network as the decoder, and all new classes and base classes share weights. And the target locator, which contains the specific weight parameters for each individual class to be detected, analyzes the 3D feature map output by the feature extractor using class-specific convolution kernels, and generates the detection results of the input samples in the form of heatmaps.
And secondly, a class-specific code generator is introduced to generate the weight of a convolution kernel in the target locator, so that the original mode of updating the weight through iterative training is replaced. The specific code generator of the class adopts a coder structure the same as that of a feature extraction network, but does not access a deconvolution network, and outputs a weight parameter (convolution kernel weight) C related to the class in a global average pooling modekTo further parameterize the parameters of the object locator, wherein the weight parameter C of the class-specific code generator networkkParameters of the synthetic target locator network (such as synthetic class dependent convolution kernel weight parameters). In addition, a contrast learning branch is connected behind the class-specific encoder and used for guiding the class-specific code generator to learn the class-specific codes with contrast perception, calculating the feature similarity of the class-specific codes and better modeling the intra-class similarity and the inter-class difference features.
Specifically, the specific step of performing the comparative learning branch training on the class code generated by the class specific code generator is as follows:
step 2.2.1, applying a layer of multilayer perceptron (1-layermulti-layer-perceptron (MLP) -head) to convert the characteristics of class codes into 128-dimensional contrast characteristics;
and 2.2.2, measuring the similarity between different classes of codes on the normalized class characteristics of the multilayer perceptron codes, and optimizing by using a loss function of the specific codes of the supervised contrast learning class so as to improve the intra-class similarity and the inter-class difference.
The loss function of the specific code of the supervised contrast learning class is formulated by the inspiration of the related work of the self-supervision and the supervised contrast learning.
Under the supervision situation, the features of the class codes of the same class are all positive sample pairs, and the features of the class codes of different classes form negative sample pairs. Class code X for zooming in to the same tagiAnd XjThe distance of the label class codes is increased, and a loss function of the specific codes of the supervised contrast learning class is designed, as shown in formulas (1) and (2). And measuring the similarity between class coding features by adopting an inner product mode. Multiple meta-tasks are sampled in the training, and each meta-task has a support set part sampling N samples from given data, denoted as { x }k,yk}k=1,2...,N,ykIs xkThe label of (1).
Figure BDA0003287955620000061
Figure BDA0003287955620000062
In the formula (1)
Figure BDA0003287955620000063
Calculated loss, L for a single sampleCCEIs the average Loss of one meta-task.
In the formula (2)
Figure BDA0003287955620000064
Representing the number of samples belonging to the same class in a meta-task, IIk≠iTo indicate the function, 0 is taken if and only if k ═ i, otherwise 1, τ is the temperature parameter [12 ] being optimized]。
Figure BDA0003287955620000065
Encode X for classiAnd (4) obtaining 128-dimensional normalized class characteristics through processing of a multilayer perceptron (MLP-Head). The numerator in log function is XiAnd XjThe denominator of the representation distance between them is XiThe representation distance from all data (positive and negative) in each meta-task. The design well delineates intra-class similarity and inter-class difference, improves the coding performance of the class-specific coder, and proves the effectiveness of the class-specific coder in later experiments.
And 3, performing meta-training on the small sample target detection model.
By using the training strategy of meta-learning, in order to make full use of basic categories, the meta-training is divided into two serial stages: in the first stage, a class feature extractor is trained on a CenterNet detection network structure by using a large amount of base class data and is used for feature extraction of a small amount of new class data; the second stage of training is divided into a plurality of sections (epicode, the source of epicode is from Oriol Vinyals, Charles Blundell, Timothy lilacrap, Daan Wierstra, et al. matching networks for one shot training. in neuroips, 2016.2t. -y.lin, p.goyal, r.girshick, k.he, and p.doll' ar.focal distance for noise object detection. iccv,2017.), the labeled training data is divided into query and support set images, which are input into the feature extractor and class-specific of the small sample object detection model, respectively, and the encoder strategy of the joint training class-specific encoder and the object locator is adopted, wherein the object locator performs the location learning of the small sample object by combining the class encoding and extracted features.
In the training mode, each epicode can sample to obtain a plurality of meta-tasks, the meta-tasks comprise different category combinations, and the mechanism enables the model to learn common parts in different meta-tasks, such as how to extract important features and compare the similarity of samples, forgetting related parts of the tasks in the meta-tasks and the like. The model learned by the learning mechanism can better classify when facing new unseen meta-task, and is more beneficial to learning class codes.
Training in the first stage: learning of feature extractors
On the base class data set, the T is enhanced by taking random turning, random scaling, clipping and color dithering as data1Standard supervised training is performed in a similar manner to CenterNet, giving a training image I e R of height h, width wh×w×3Inputting a feature extractor f () based on a ResNet residual error network to extract a feature map m ═ f (I),
Figure BDA0003287955620000071
r is the down-sampling factor and m contains c channels, corresponding to c classes in the target object. Then, in the target locator h, the convolution kernel c of the base class is learnedk∈R1×1×cPerforming convolution operation on the feature map m to obtain heatmap of each class
Figure BDA0003287955620000072
Wherein
Figure BDA0003287955620000073
Representing the detected keypoints, i.e. the c-class target object was detected in (x, y) coordinates. While
Figure BDA0003287955620000074
It means that there is no c-type object at the current coordinate point and it is considered as background.
As shown in the formula (3),
Figure BDA0003287955620000075
the number of channels is c + 4. And the remaining four channels respectively correspond to the prediction of the offset of the center point of the boundary box and the prediction of the size length and the width of the boundary box. An indicates a convolution operation, KbIs the number of base class samples.
Figure BDA0003287955620000076
In training, we want to set the ground-route key point
Figure BDA0003287955620000077
The coordinates after down-sampling, calculated for training, are set to
Figure BDA0003287955620000078
Respectively passing through Gaussian kernels Y on c channel characteristic graphs obtained by down samplingxycDistributing each group-route key point on the heatmap to
Figure BDA0003287955620000079
Is expressed in terms of the form. Y isxycIs calculated as shown in formula (4), where σpIs an adaptive standard deviation related to the target length and width. If the Gaussian distributions generated by different objects in the same class are overlapped, a larger Gaussian point is selected in the overlapped range.
Figure BDA00032879556200000710
The training of key point k on Heatmap uses a loss function that rewrites Focal loss [20], as shown in equation (5).
Figure BDA0003287955620000081
Where α and β are hyper-parameters of Focal loss, and α ═ 2 and β ═ 4 are set for experiments, and N is the number of key points in the image I, see [13] for more details.
The down-sampled feature map is remapped back to the original image with a deviation in accuracy, so that an offset needs to be predicted for each center point, i.e.,
Figure BDA0003287955620000082
the center points of all classes c share the same offset prediction, L1loss, as shown in equation (6).
Figure BDA0003287955620000083
Wherein
Figure BDA0003287955620000084
For predicting the resulting offset
Figure BDA0003287955620000085
Representing values calculated in advance during the training process.
Suppose that
Figure BDA0003287955620000086
Is target k, and belongs to class ckThe central point is
Figure BDA0003287955620000087
And regressing the size of each target k to finally regress to
Figure BDA0003287955620000088
SkAlso calculated in advance, is the length and width values after the ground-truth downsampling. We predict networks using keypoints
Figure BDA0003287955620000089
To predict all central points, use is made of
Figure BDA00032879556200000810
As the predicted length and width value for each point in heatmap, L is also used1The regression loss function is trained. As shown in equation (7):
Figure BDA00032879556200000811
when prediction is carried out, local peak points of each c category on heatmaps are respectively extracted
Figure BDA00032879556200000812
These peaks are activation values greater than or equal to YkAnd (4) preserving the first 100 points of the 8 th neighborhood by adopting a 3X3 maximum pooling mode, wherein the peak values are probability values of the c-th class target center points. According to the set threshold value, calling out the central point greater than the threshold value from the selected 100 peak value points
Figure BDA00032879556200000813
As a final result. And use
Figure BDA00032879556200000814
And regressing a target box as a confidence index of the current point. The predicted target box is shown in equation (8):
Figure BDA00032879556200000815
wherein
Figure BDA00032879556200000816
Is a prediction of the center point offset.
Figure BDA00032879556200000817
Wherein
Figure BDA00032879556200000818
Is the size prediction of the bounding box. Offset and bounding box both use L1The regression loss function is trained as shown in equation (6) and equation (7).
The total loss of the training phase is composed of three parts, Heatmap, offset, and size, and is used to optimize the parameters of the feature extractor f and the c parameters of the locator. As shown in the formula (9), λ was set in the experimentsize0.1 and λoff=1。
Ldet=LksizeLsizeoffLoff (9)
The goal at this stage is simply to learn a feature extractor, while the target locator is a conventional CenterNet locator, and discarded at the second stage, using only the trained feature extractor f ().
And (3) training in the second stage: class specific encoder learning
The class-related convolution kernel parameters obtained by training in the previous stage are only fixed parameters of a base class, and in the second stage, parameters of a feature extractor are frozen, and a class-specific encoder connected with a comparison learning branch is mainly trained, so that the class-specific encoder can efficiently synthesize a new class according to a small number of labeled samples. To efficiently train class-specific encoders, we employ an epicode meta-learning training strategy [13 ].
The specific method comprises the following steps: the whole training data is divided into a plurality of epasopodes, each epasopode contains a specified number of meta-tasks, and the meta-tasks sample a class label set L (for example: L-Banana, umbrella …) from all classes, and according to L, we select a training sample of a support set S and a query set Q for each meta-task for training, wherein S and Q are obtained by randomly distributing pictures of each class in the label set. Data enhancement T for images with random horizontal flipping, random cropping and color dithering as support sets2The image x in each support set is subjected to two data enhancements to generate a sample xaAnd xbAs a basic positive sample pair, the input class specific encoder extracts features. The class specific encoder is initialized with the encoder weights of the feature extractor trained in the first stage. In the forward propagation process, for each meta-task, T is utilized1And (3) performing data enhancement on the image of the query set, and extracting the features of the query set by using a feature extractor trained in one stage, wherein a feature graph of c channels is obtained according to a formula (10). At the same time, class specific encoder utilizes a pass T2Generating corresponding weight parameters by the processed support set image
Figure BDA0003287955620000091
The specific operation is shown in formula (11).
mQ=f(I),I∈Q (10)
Figure BDA0003287955620000092
Wherein m isQThe features obtained by the feature extractor for the query set image I,
Figure BDA0003287955620000093
is a support set sample of class k. Finally, 3X256 dimensional class code is output through global average pooling
Figure BDA0003287955620000094
On the one hand, all samples in the support set S are utilized
Figure BDA0003287955620000095
Generated correspondingly
Figure BDA0003287955620000096
Inputting a contrast learning branch, and encoding the contrast learning branch into normalized contrast characteristics by an MLP-head, then calculating similarity scores of different types of encoding, and optimizing an objective function through back propagation. On the other hand, in order to facilitate comparison with other detection methods, only all x are takenaClass coding obtained from sample, and class coding of the same class
Figure BDA0003287955620000097
Is subjected to average pooling to obtain
Figure BDA0003287955620000098
We are interested in the image features m of the same category of query setQAnd coding of these classes
Figure BDA0003287955620000101
Inputting into a target locator for convolution operation to generate a heat map
Figure BDA0003287955620000102
Target detection for the query set image is accomplished as shown in equation (12).
Figure BDA0003287955620000103
Similar to a one-stage CenterNet, we use L1loss, as shown in equation (13). The average absolute prediction error of Q is minimized by updating the parameters of the class code generator and the target locator. Wherein n is the number of images in the query set, and Z is the group-route heatmap.
Figure BDA0003287955620000104
The total loss of the training phase is composed of four parts of Heatmap, offset, size, and contrast loss, as shown in equation (14):
Lmeta-det=LQsizeLsizeoffLoff+LCCE (14)
and 4, performing meta-testing on the trained small sample target detection model.
After the feature extractor, the class encoder and the target locator are subjected to meta-training, a small number of new class samples with labels are input into the class encoder to generate weight parameters of a specific class, and the target locator is parameterized. And the feature extractor extracts features of the test picture to output a feature map, and then the target locator finishes the detection of the target in the test picture. The new class of test pictures are propagated forward to detect possible targets, and the model does not need to be trained or fine-tuned again.
The robust small sample detection model is obtained through two-stage training, and one or more new classes are injected into the model to be tested in a mode similar to the second-stage training. Firstly, inputting a group of support sets of new class samples into a model, and extracting class codes of the new classes by a class specific coder; inputting a new type of test picture into the model, and extracting the features by using a feature extractor to obtain a multi-channel feature map; inputting the two into a target locator, performing convolution operation according to a formula (3) to obtain a thermodynamic diagram, extracting peak points in a threshold range in the thermodynamic diagram, and acquiring a target candidate frame according to a formula (8) to obtain a detection result of the test picture.
Under the training of a large number of base class images containing rich labels and a small number of new classes of small samples (few-shot) containing labels, the detection effect of the new classes of test pictures is improved, namely the improvement of mAP and AR scores is improved.
Where AP is the (average precision) and mapp is the average precision of m sample classes. Precision (Precision) is for the prediction result, which indicates how many of the samples predicted to be positive are true positive samples. (the proportion of all samples predicted as positive samples that are actually positive samples) how much of the positive samples are correctly predicted (for evaluating the accuracy of prediction) in all samples predicted as positive samples.
Recall (recalls, also known as TPR) is for the original sample, which indicates how many of the positive examples in the sample were predicted correctly. (the ratio of predicted positive samples among all actual positive samples) how many of all positive samples were correctly predicted (to evaluate how many correct samples were predicted).
As shown in fig. 4, the area under the PR curve is the AP value.
The experiment is carried out by adopting a COCO2017[14] reference data set commonly used for target detection, wherein 118287 training sets and 5000 verification sets cover 80 object classes in total, and 20 classes serve as new classes. The 20 classes are the same as those covered by the PASCAL VOC2007[15] dataset, with the remaining 60 classes in the COCO dataset as the base classes. Thus, the experiment can be divided into co-dataset evaluation and cross-dataset evaluation with 20 new classes of PASCAL VOC datasets as the base class 60 of the COCO dataset as the test.
COCO was assessed with dataset:
the method comprises the steps of using a base class training image of a COCO data set for two stages of model element training, firstly adjusting the size of the base class training image to 512x512, carrying out supervised training according to the training mode of the first stage, and then in the second stage, setting each epsilon to randomly extract 28 meta-tasks under the limitation of a GPU memory, wherein the meta-tasks comprise detection of 3 classes, each class comprises 5 labeling frames, and more tasks are beneficial to performance improvement. For meta-testing, we use multiple randomly sampled sets of support sets containing only 20 new class samples in the COCO training set for extracting class codes, where each set of support sets is injected into the model once and the model is updated only once, where each new class contains only { k ═ 1,5,10} labeled boxes. And the performance of our small sample target detection model was evaluated using the new class of images on the COCO validation set as test images. We compared our model with several other popular small sample target detection methods: 1) a standard Fine-Tuning detection model based on fast RCNN; 2) model-inflammatory Meta-learning (MAML); 3) non-incremental Few-shot object detection view feature reweighing; 4) incremental small sample object detection network ONCE. The results of the experiment are shown in table 1. In { k ═ 1,5,10}, the test results of injecting new classes into several small sample detection models after training show that the AP value of our method is higher than that of other comparison methods, and the AP and AR have the best performance in the test of the base class and the mixed test of the base class and the new class. Meanwhile, in the case of { k ═ 10}, the pair of detection results of the ONCE and our method is as shown in fig. 5, so that it is easy to see that our method has fewer misjudgments and detects a difficult object in an image more accurately.
TABLE 1 COCO comparison with dataset detection results
Figure BDA0003287955620000111
Figure BDA0003287955620000121
VOC evaluation across datasets:
in two stages of meta-training, the base class data of the COCO data set is also adopted for training, and the training mode is completely consistent with the COCO same data set evaluation method. The difference is that in the meta-test phase, a plurality of support sets of new classes are sampled on the training set of the COCO2017 data set, and similarly, in each set of support set, the model is updated only once, wherein each new class only contains { k ═ 5,10} labeled boxes. And the performance of our small sample target detection model was evaluated using the PASCAL VOC2007 test set images as test images. The experimental comparison results are shown in table 2. In addition, in the test of ten labeled boxes for each new class sample in each set of support sets, namely { k ═ 10}, a comparison graph of the detection effect of the ONCE and our method is given as fig. 6. From the results, it can be seen that five and ten labeled boxes are sampled for each new class on the COCO training set, and the test is performed on the PASCAL VOC test set, and in the obtained detection results, both the AP and AR values are superior to those on the COCO data set. It is demonstrated that the detection of images of the COCO dataset is more challenging than the PASCAL VOC dataset in the small sample object detection task. Meanwhile, it can be concluded that our framework can be easily migrated to other data sets for detection, which is of great significance for the related work of small sample target detection.
TABLE 2 VOC Cross-dataset detection comparison results
Figure BDA0003287955620000122
The method has the idea of supervised contrast learning to the small sample target detection model, and the access of the contrast learning branch improves the performance of the specific coding of the model class, thereby solving the problem of insufficient generalization performance of the model to the new class to a certain extent. The training strategy of the invention does not construct a balance small sample set of a base class and a new class in meta-training for training, but directly injects a small amount of new class samples which are never seen in a model in a meta-testing stage for detection. The method is extremely challenging, the newly injected detection category is easily judged as a base class by mistake, the generalization performance of the model is very tested, but the work is the key point of the research of the small sample target detection research field, and by adopting the mode, the method can easily introduce the new type sample for effective detection, and is more beneficial to the future implementation in a specific application scene. In the experimental part, the method is evaluated through the main evaluation indexes AP and AR of target detection, so that a good detection effect is obtained, and the problem that the existing detection model is insufficient in generalization performance of a new class is solved to a certain extent. Meanwhile, the comparison learning benefits from more comparison sample size and more GPU memory, and the invention can sample more mate-tasks per epamode and further improve the performance.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (8)

1. An incremental target detection method based on no anchor is characterized by comprising the following steps:
step 1, selecting a target detection model;
step 2, constructing a small sample target detection model based on the target detection model in the step 1;
step 3, performing meta-training on the small sample target detection model;
and 4, performing meta-testing on the trained small sample target detection model.
2. The method for incrementally detecting the target based on the anchorless as claimed in claim 1, wherein in the step 1, the specific step of constructing the target detection model comprises:
step 1.1, selecting a CenterNet detection network as a target detection model of a base class network.
3. The method for incremental target detection based on anchorless according to claim 2, wherein in step 2, the specific steps of constructing the small sample target detection model are as follows:
step 2.1, regarding the CenterNet detection network as consisting of a feature extractor and a target locator, wherein the feature extractor adopts a ResNet residual error network as an encoder, a deconvolution network as a decoder, and all new classes and base classes share weight; the target locator contains the convolution kernel weight of each individual class to be detected, and the target locator analyzes the 3D feature map output by the feature extractor by using a class-specific convolution kernel to generate a detection result of the input sample in the form of heat-maps;
step 2.2, introducing a class-specific code generator, wherein the class-specific code generator is provided with a class encoder with the same structure as the encoder of the feature extractor and is used for generating the convolution kernel weight CkAnd using the generated convolution kernel weight CkParameterizing the target locator; the class codes generated by the class specific code generator are subjected to comparison learning branch training so as to improve the consistency among the same class codes and expand the difference of different class codes.
4. The method as claimed in claim 3, wherein in step 3, the step of meta-training the small sample target detection model comprises:
step 3.1, training a class feature extractor on a CenterNet detection network by using a base class data set, wherein the class feature extractor is used for feature extraction of new class data;
step 3.2, dividing the base class data set with the label into a support set and a query set image, inputting the query set into a feature extractor, inputting the support set into the class specific code generator, extracting features from the base class data set by the feature extractor, and generating class codes related to the base class data set by the class specific codes;
and 3.3, performing combined training on the class specific code generator and the target locator, so that the target locator performs positioning learning on the new class data by combining the class codes and the extracted features.
5. The method for incremental target detection based on anchorless as claimed in claim 4, wherein in step 4, the step of performing meta-test on the trained small sample target detection model comprises the following specific steps:
step 4.1, inputting a small amount of new class data with labels to a trained class-specific encoder to generate a weight parameter of a specific class, and parameterizing a target locator;
step 4.2, the class feature extractor extracts the features of the new class data and outputs a feature map,
and 4.3, completing the detection of the target in the test image by the parameterized target positioner.
6. The method according to claim 3, wherein in step 2.2, the specific step of performing the contrast learning branch training on the class code generated by the class-specific code generator is as follows:
2.2.1, converting the characteristics of class codes into 128-dimensional contrast characteristics by applying a layer of multilayer perceptron;
and 2.2.2, measuring the similarity between different classes of codes on the normalized class characteristics of the multilayer perceptron codes, and optimizing by using a loss function of the specific codes of the supervised contrast learning class so as to improve the intra-class similarity and the inter-class difference.
7. The method as claimed in claim 6, wherein in step 2.2.2, it is assumed that there are two class codes X with the same labeliAnd XjThe loss function for supervised contrast learning class specific coding is:
Figure FDA0003287955610000021
Figure FDA0003287955610000022
in the formula (1), the first and second groups,
Figure FDA0003287955610000023
calculated loss function value, L, for a single sampleCCEAn average loss function value for one element task;
formula (2)
Figure FDA0003287955610000024
Representing the number of samples belonging to the same class in a meta-task, IIk≠iTo indicate the function, 0 is taken if and only if k ═ i, otherwise 1, τ is the temperature parameter being optimized,
Figure FDA0003287955610000025
encode X for classiThe 128-dimensional normalized class characteristics are obtained by processing through a multilayer perceptron, and the molecules in the log function are XiAnd XjThe denominator of which is XiToken distance from all data in each meta-task including positive and negative examples.
8. The method according to claim 1, wherein in step 2.2, the class-specific code generator outputs class-dependent convolution kernel weights C by means of global average poolingkTo further parameterize weight parameters of the target locator.
CN202111153974.XA 2021-09-29 2021-09-29 Anchor-free incremental target detection method Active CN113822368B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111153974.XA CN113822368B (en) 2021-09-29 2021-09-29 Anchor-free incremental target detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111153974.XA CN113822368B (en) 2021-09-29 2021-09-29 Anchor-free incremental target detection method

Publications (2)

Publication Number Publication Date
CN113822368A true CN113822368A (en) 2021-12-21
CN113822368B CN113822368B (en) 2023-06-20

Family

ID=78921753

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111153974.XA Active CN113822368B (en) 2021-09-29 2021-09-29 Anchor-free incremental target detection method

Country Status (1)

Country Link
CN (1) CN113822368B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114663707A (en) * 2022-03-28 2022-06-24 中国科学院光电技术研究所 Improved few-sample target detection method based on fast RCNN
CN114676771A (en) * 2022-03-22 2022-06-28 西安交通大学 Online target detection and promotion algorithm based on self-supervision and similarity suppression
CN114898154A (en) * 2022-05-16 2022-08-12 北京有竹居网络技术有限公司 Incremental target detection method, device, equipment and medium
CN115880266A (en) * 2022-12-27 2023-03-31 深圳市大数据研究院 Intestinal polyp detection system and method based on deep learning
CN116363085A (en) * 2023-03-21 2023-06-30 江苏共知自动化科技有限公司 Industrial part target detection method based on small sample learning and virtual synthesized data
CN117011575A (en) * 2022-10-27 2023-11-07 腾讯科技(深圳)有限公司 Training method and related device for small sample target detection model

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050099330A1 (en) * 1999-05-25 2005-05-12 Safe Zone Systems, Inc. Object detection method and apparatus
CN110969205A (en) * 2019-11-29 2020-04-07 南京恩博科技有限公司 Forest smoke and fire detection method based on target detection, storage medium and equipment
CN112329827A (en) * 2020-10-26 2021-02-05 同济大学 Increment small sample target detection method based on meta-learning
CN112819110A (en) * 2021-04-19 2021-05-18 中国科学院自动化研究所 Incremental small sample target detection method and system based on weight generation
CN112861720A (en) * 2021-02-08 2021-05-28 西北工业大学 Remote sensing image small sample target detection method based on prototype convolutional neural network
WO2021154624A1 (en) * 2020-01-27 2021-08-05 Matthew Charles King System and method for performing machine vision recognition of dynamic objects
CN113221987A (en) * 2021-04-30 2021-08-06 西北工业大学 Small sample target detection method based on cross attention mechanism
CN113361645A (en) * 2021-07-03 2021-09-07 上海理想信息产业(集团)有限公司 Target detection model construction method and system based on meta-learning and knowledge memory
CN113379718A (en) * 2021-06-28 2021-09-10 北京百度网讯科技有限公司 Target detection method and device, electronic equipment and readable storage medium
CN113392855A (en) * 2021-07-12 2021-09-14 昆明理工大学 Small sample target detection method based on attention and comparative learning
CN113393457A (en) * 2021-07-14 2021-09-14 长沙理工大学 Anchor-frame-free target detection method combining residual dense block and position attention

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050099330A1 (en) * 1999-05-25 2005-05-12 Safe Zone Systems, Inc. Object detection method and apparatus
CN110969205A (en) * 2019-11-29 2020-04-07 南京恩博科技有限公司 Forest smoke and fire detection method based on target detection, storage medium and equipment
WO2021154624A1 (en) * 2020-01-27 2021-08-05 Matthew Charles King System and method for performing machine vision recognition of dynamic objects
CN112329827A (en) * 2020-10-26 2021-02-05 同济大学 Increment small sample target detection method based on meta-learning
CN112861720A (en) * 2021-02-08 2021-05-28 西北工业大学 Remote sensing image small sample target detection method based on prototype convolutional neural network
CN112819110A (en) * 2021-04-19 2021-05-18 中国科学院自动化研究所 Incremental small sample target detection method and system based on weight generation
CN113221987A (en) * 2021-04-30 2021-08-06 西北工业大学 Small sample target detection method based on cross attention mechanism
CN113379718A (en) * 2021-06-28 2021-09-10 北京百度网讯科技有限公司 Target detection method and device, electronic equipment and readable storage medium
CN113361645A (en) * 2021-07-03 2021-09-07 上海理想信息产业(集团)有限公司 Target detection model construction method and system based on meta-learning and knowledge memory
CN113392855A (en) * 2021-07-12 2021-09-14 昆明理工大学 Small sample target detection method based on attention and comparative learning
CN113393457A (en) * 2021-07-14 2021-09-14 长沙理工大学 Anchor-frame-free target detection method combining residual dense block and position attention

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
JUAN-MANUEL 等: "Incremental Few-Shot Object Detection", 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), pages 13843 - 13852 *
MENG CHENG等: "Meta-Learning-Based Incremental Few-Shot Object Detection", IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, pages 2158 - 2169 *
张明伟;蔡坚勇;李科;程玉;曾远强;: "基于DE-YOLO的室内人员检测方法", 计算机系统应用, no. 01, pages 207 - 212 *
徐培等: "基于两阶段投票的小样本目标检测方法", 计算机应用, vol. 34, no. 4, pages 1126 - 1129 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114676771A (en) * 2022-03-22 2022-06-28 西安交通大学 Online target detection and promotion algorithm based on self-supervision and similarity suppression
CN114663707A (en) * 2022-03-28 2022-06-24 中国科学院光电技术研究所 Improved few-sample target detection method based on fast RCNN
CN114898154A (en) * 2022-05-16 2022-08-12 北京有竹居网络技术有限公司 Incremental target detection method, device, equipment and medium
CN117011575A (en) * 2022-10-27 2023-11-07 腾讯科技(深圳)有限公司 Training method and related device for small sample target detection model
CN115880266A (en) * 2022-12-27 2023-03-31 深圳市大数据研究院 Intestinal polyp detection system and method based on deep learning
CN115880266B (en) * 2022-12-27 2023-08-01 深圳市大数据研究院 Intestinal polyp detection system and method based on deep learning
CN116363085A (en) * 2023-03-21 2023-06-30 江苏共知自动化科技有限公司 Industrial part target detection method based on small sample learning and virtual synthesized data
CN116363085B (en) * 2023-03-21 2024-01-12 江苏共知自动化科技有限公司 Industrial part target detection method based on small sample learning and virtual synthesized data

Also Published As

Publication number Publication date
CN113822368B (en) 2023-06-20

Similar Documents

Publication Publication Date Title
Liu et al. Point2sequence: Learning the shape representation of 3d point clouds with an attention-based sequence to sequence network
CN113822368A (en) Anchor-free incremental target detection method
CN109815364B (en) Method and system for extracting, storing and retrieving mass video features
CN106909924B (en) Remote sensing image rapid retrieval method based on depth significance
Chen et al. Pointgpt: Auto-regressively generative pre-training from point clouds
Oertel et al. Augmenting visual place recognition with structural cues
CN109443382A (en) Vision SLAM closed loop detection method based on feature extraction Yu dimensionality reduction neural network
CN110046579B (en) Deep Hash pedestrian re-identification method
CN112801068B (en) Video multi-target tracking and segmenting system and method
Wei et al. Transformer-based domain-specific representation for unsupervised domain adaptive vehicle re-identification
Wanyan et al. Active exploration of multimodal complementarity for few-shot action recognition
Zeng et al. Robust multivehicle tracking with wasserstein association metric in surveillance videos
CN111291695B (en) Training method and recognition method for recognition model of personnel illegal behaviors and computer equipment
CN114663798A (en) Single-step video content identification method based on reinforcement learning
CN115587335A (en) Training method of abnormal value detection model, abnormal value detection method and system
CN115690549A (en) Target detection method for realizing multi-dimensional feature fusion based on parallel interaction architecture model
Magdy et al. Violence 4D: Violence detection in surveillance using 4D convolutional neural networks
Balasubramanian et al. Traffic scenario clustering by iterative optimisation of self-supervised networks using a random forest activation pattern similarity
Osman et al. PlaceNet: A multi-scale semantic-aware model for visual loop closure detection
Yu et al. Rhythmic representations: Learning periodic patterns for scalable place recognition at a sublinear storage cost
CN112766368A (en) Data classification method, equipment and readable storage medium
Chen et al. Grid-based multi-object tracking with Siamese CNN based appearance edge and access region mechanism
Wang et al. Sture: Spatial–temporal mutual representation learning for robust data association in online multi-object tracking
Meng et al. A GPU-accelerated deep stereo-LiDAR fusion for real-time high-precision dense depth sensing
Le et al. Btel: A binary tree encoding approach for visual localization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant