CN112364747A

CN112364747A - Target detection method under limited sample

Info

Publication number: CN112364747A
Application number: CN202011219061.9A
Authority: CN
Inventors: 黄丹; 冯欣; 陈志�; 吴浩铭
Original assignee: Chongqing High Tech Zone Feima Innovation Research Institute
Current assignee: Chongqing High Tech Zone Feima Innovation Research Institute
Priority date: 2020-11-04
Filing date: 2020-11-04
Publication date: 2021-02-12
Anticipated expiration: 2040-11-04
Also published as: CN112364747B

Abstract

The invention relates to a target detection method under a limited sample, which mainly comprises the following steps that firstly, picture samples in a new class are input into a backbone network for extraction of detection target characteristics and regression of a boundary frame, a two-classification task of whether a target object exists or not is carried out on an obtained boundary candidate frame, the boundary frame which obviously does not comprise the detection target is removed, and screening is carried out according to classification scores to obtain a candidate recommendation area of the sample; and secondly, forming a full-connected graph by the obtained convolution characteristics corresponding to the candidate regions, and processing the graph structure by using a trained neural network to obtain a class label of each candidate region. The method is universal in the field of few samples and has wide potential application.

Description

Target detection method under limited sample

Technical Field

The invention relates to the field of image detection and calculation, in particular to a target detection method under a limited sample.

Background

In the past few years, the deep learning algorithm based on the convolutional neural network has achieved remarkable performance in the field of target detection, and the success of the deep learning algorithm depends on a large number of target detection data sets with complete and accurate frame annotations. In practical applications, data with a complete annotation tag may be limited for a given target detection task. When data is scarce, the convolutional neural network can be severely over-fitted and cannot be generalized, and the capacity of the detector is limited. In contrast, humans exhibit a powerful ability to do this task: children can learn to identify new categories in a few pictures. Data such as medical images, endangered animals, etc. lack examples or it is difficult to obtain complete accurate data, so the ability to learn to detect objects from a small number of samples is needed for computer vision.

Since in the real world, the target object has large differences in illumination, shape, texture, etc., it is a challenge to detect few samples. Some progress has been made in current studies on few sample learning, but these methods focus on image classification and involve few target detection problems. For low-sample detection, the core problem is how to locate the target object in a cluttered background by a small number of sample studies. In this task, our goal is to challenge the problem of low-sample object detection, as shown in fig. 1, and in particular, given some base classes that have enough samples with label annotations and new classes that have only a small amount of data with labels, the object obtains a model that can detect both the new classes and the base classes. To date, few methods are available. Recently, meta-learning provides a reliable solution to similar problems, such as the low-sample classification problem. However, the difficulty of target detection is much greater, and it not only involves classification prediction of the target, but also involves positioning of the target, so the existing few-sample classification method cannot be directly applied to the problem of few-sample detection. Taking the matching network and prototype network as an example, it is unclear how to construct the target prototype for matching and positioning since there may be dispersed objects of irrelevant class in the image, or no target object at all.

Disclosure of Invention

Aiming at the problems in the prior art, the technical problems to be solved by the invention are as follows: how to rapidly and accurately detect the target under the condition of few samples.

In order to solve the technical problems, the invention adopts the following technical scheme: a target detection method under a limited sample comprises the following steps:

s100: inputting the labeled samples in all the new classes into a trunk neural network, and extracting the characteristics of the detection target;

inputting the labeled samples in all the new classes into a regional recommendation network, wherein the regional recommendation network adopts a bounding box returning part in an SSD and adds a two-classification task with or without a target object behind the bounding box returning part, and a plurality of candidate recommendation regions are obtained for each labeled sample through the regional recommendation network;

performing secondary classification processing on all the boundary frames with the label samples to remove the boundary frames obviously not including the detection target so as to obtain all candidate regions of each label sample;

s200: constructing each candidate region with a label sample and the corresponding detection target feature obtained in the step S100 into a complete graph, wherein each node in the complete graph represents the feature corresponding to each recommended region, and each edge represents the probability that two connected nodes belong to the same class;

s300: then the complete graph obtained in the step S200 is used as the input of the convolutional neural network, the node features of the complete graph are formed by the convolutional features of the trunk network in the step S100 corresponding to the candidate regions, and the class relationship among the candidate recommendation regions is used as the edge features of the complete graph;

obtaining the predicted values of the node characteristics and the edge characteristics through multiple times of node characteristic updating and edge characteristic updating;

after N iterations, the predicted value and the true value are used for calculating loss, gradient feedback is carried out, network parameters are updated, when the calculated loss is larger than a threshold value, the iteration times are reset, the network is continuously updated until the loss is not larger than the threshold value, and a trained network model is obtained;

s400: the method comprises the steps of obtaining features of a target to be detected and all candidate areas to be detected by using an image to be detected through the S100 method, obtaining a complete image to be detected through the S200 method, inputting the complete image to be detected into a network model trained through the S300 method, outputting all node features and corresponding edge features of the image to be detected, conducting softmax on each query node and supporting nodes of each category and the edge features of each query node and all supporting nodes, obtaining the probability that the query node belongs to the category, and taking the category with the highest probability as the category corresponding to the target to be detected in the candidate recommended area to be detected.

Preferably, the area recommendation network in S100 is trained by the following method:

s110, training the regional recommendation network on a base class data set in a scene learning mode;

s120: the area recommendation network trained on the base class data set in the step S110 is trained again on a new class data set with a small amount of labeled data;

a loss function L used in the training processes of S110 and S120_totalComprises the following steps:

L_total＝L_main+λ_BDL_BD

where L2 regularization is used to penalize F_BDActivation of (2):

L_BD＝||F_BD||₂

L_main＝L_reg+L_cls

said L_BDRepresenting background suppression regularization, F_BDRepresenting a characteristic region corresponding to the background of the image, said L_regRepresents the regression loss, L, of the target bounding box_clsRepresenting a loss of binary class, λ, of a non-target object_BDRepresenting the weighting coefficients of the background suppression regularization.

Preferably, in S200, the method for processing all candidate regions and corresponding detection target features of each labeled sample into a complete graph includes: and taking the target feature of each category of the new category as a support node, taking the feature of a detection target corresponding to a candidate recommendation area obtained through a regional recommendation network as a query node, determining edge features between the nodes according to the categories between the nodes, wherein the edge feature value between the support nodes belonging to the same category is 1, and the edge feature value between the support nodes not belonging to the same category is 0.

Preferably, the training process of the S300 convolutional neural network is as follows:

s310: let G denote the complete graph, v_iAnd e_ijRespectively representing the ith node feature in the node feature set and the edge feature between the ith node and the jth node in the edge feature set, and the true value y of each edge label_ijDefined by the true value of the node feature:

wherein, y_iClass label, y, representing the ith node_jA category label representing the jth node;

each edge feature is a two-dimensional feature vector e_ij∈[0,1]²The node features are initialized by the middle-level features of the recommended region, and each edge feature is initialized by the edge label in the following way:

wherein the content of the first and second substances,

representing a similarity relationship between two nodes.

S320: the convolutional neural network is composed of L layers, forward propagation is composed of alternate edge feature update and node feature update, and the node features of L-1 layers are given

And edge characteristics

Firstly, updating node characteristics according to a field aggregation process, performing characteristic conversion on the obtained aggregated characteristics by aggregating the characteristics of other nodes and edge characteristics in proportion, and updating the node characteristics of the layer;

edge characteristics of l-1 layer

Degree coefficient as corresponding node:

wherein the content of the first and second substances,

a representation of a feature transformation network is shown,

and

respectively representing similarity relation and dissimilarity relation between the l-1 level nodes i and j,

representing the node characteristics of level l-1 node j,

a parameter representing a feature transformation network of layer l;

s330: the edge feature updating is based on the updated node features, the node similarity scores between any pair of nodes are obtained again, and each edge feature is updated by combining the previous edge feature value and the updated node similarity score;

wherein the content of the first and second substances,

to measure the network, a similarity score is calculated,

a parameter representing a metric network used to compute the similarity score,

representing the similarity scores of the ith node and the jth node at the l-1 level,

representing the dissimilarity score of the ith node and the jth node at the l-1 level,

a node characteristic at level l representing a kth node;

s340: edge prediction labels are ultimately obtained from edge features, i.e.

Each node V_iClassification can be done by simple weighted voting on the support node-related edge features of known class information added when constructing the full graph,simple weighted voting is to sum edge features of the support nodes belonging to the category and the query nodes, obtain normalized probability of the query nodes belonging to the category through softmax, select the category with the highest probability from all the categories, and obtain a final category label; the edge label prediction probability is defined as:

wherein, C_kRepresenting the kth class, T representing the classification task for a given full graph,

representing the probability that the ith node belongs to the kth class.

Preferably, the loss function in the training process of the S300 convolutional neural network is as follows:

wherein, Y_m,eRepresenting the true values corresponding to all edge labels,

indicating the predicted value of the l layer of the network under the m task of all the edge labels.

Compared with the prior art, the invention has at least the following advantages:

in the invention, a new target detector with less samples based on graph convolution is provided to solve the target detection problem under the condition of less samples. Firstly, the advantages of a traditional target detection framework SSD are fully utilized, background suppression regularization is introduced, and fine adjustment difficulty of few-sample detection is reduced. Secondly, a complete graph is constructed for the proposed candidate region, and the data of the graph structure is processed in a graph convolution mode to obtain a final detection result. And a scene learning mode is adopted on the two types of data sets, so that a few-sample learning task is simulated, and the few-sample learning capability of the model is fully improved. In the work that follows, the correctness and rationality of the proposed method will be demonstrated by more detailed, more thorough experiments.

Drawings

Fig. 1 shows target detection in the case of a small number of samples.

FIG. 2 is an overall block diagram of the method of the present invention.

Fig. 3 is a detector based on graph convolution.

Fig. 4 is a network structure of a node feature transformation network and a node similarity metric.

Detailed Description

The method of the present invention is described in further detail below with reference to the accompanying drawings.

Given a support image S with target objects and a query image Q which may contain target objects, the task is to find all target objects belonging to the support category in the query image and mark them with a tight border. If the support set contains N categories, each of which contains K instances, one such problem is referred to as N-way K-shot detection.

A few sample target detection setting is defined in which there are two types of data available for training, namely the base class and the new class. For the base class, there is a large amount of annotation data available, while the new class provides only a few labeled examples. Our goal is to learn the way to detect new objects by using knowledge in the base class, with both the base class and the new class existing.

Such a low-sample target detection setup is useful because it is well suited to practical situations where one may wish to deploy a pre-trained detector for a new class with only a few labeled examples. More specifically, large scale target detection datasets (e.g., PSACAL VOC, MSCOCO) can be used to pre-train the detection model. However, the number of target object classes is quite limited, especially compared to the huge object classes in the real world. Therefore, it is imperative to solve this problem of target detection with few samples.

Example (b): referring to fig. 2-4, a method for detecting a target under a limited sample includes the following steps:

s100: and inputting the labeled samples in all the new classes into a backbone neural network, removing a rear full-connection layer by using the backbone neural network through a classical classification network VGG16, and extracting the features of the input image, wherein the features mainly comprise contour features, texture features and color features.

Inputting the labeled samples in all new classes into a regional recommendation network, adopting a boundary box regression part in an SSD (single network multi-box detector) and then adding a two-classification task with or without a target object, and training by adopting the following training method provided by the invention to obtain a plurality of boundary box recommendation regions for each labeled sample.

The regional recommendation network is trained by adopting the following method:

and S110, training the regional recommendation network on the base class data set in a scene learning mode, wherein the training mode of the scene learning belongs to the prior art, and the training mode can simulate a learning task with few samples, so that the difficulty in fine adjustment can be reduced, and the learning capacity of the few samples can be improved.

S120: training the area recommendation network trained on the base class data set by the S110 on a new class data set with a small amount of labeled data; the training of this step is actually fine tuning.

L_total＝L_main+λ_BDL_BD

where L2 regularization is used to penalize F_BDActivation of (2):

L_BD＝||F_BD||₂

L_main＝L_reg+L_cls

said L_BDRepresenting background suppression regularization, F_BDRepresentation and imageCharacteristic region corresponding to background, L_regRepresents the regression loss, L, of the target bounding box_clsRepresenting a loss of binary class, λ, of a non-target object_BDRepresenting the weighting coefficients of the background suppression regularization.

In the training stage of the regional recommendation network base class, the loss function mainly comprises two parts, wherein one part is regression loss of a target boundary box and two classification losses of whether a target object exists or not:

L_main＝L_reg+L_cls

the regression loss of the bounding box adopts the same loss function in the SSD, the common classification loss, namely the binary cross entropy loss is adopted for the target object, and the sum of the two parts is used as the loss function in the training stage of the regional recommendation network.

In order to further enhance the detection capability of few samples in a new class, a new regularization mode is adopted, and the background is inhibited and regularized L_BDThe regularization method is adopted for training, so that the interference of complex background information on the positioning performance can be reduced. Background suppression (BD) regularization is performed by using knowledge of objects on the new class, i.e., the true bounding box in the training image. Specifically, for training images in the new class, we first generate a convolved feature cube from the middle convolutional layer of the backbone network. Then, I mask this convolution cube with the real bounding box of all objects in the image. Thus, we can identify the feature region corresponding to the image background, i.e., F_BD. To suppress background interference, we penalize F using L2 regularization_BDActivation of (2):

L_BD＝||F_BD||₂。

by means of the background suppression regularization, the model can focus more on the region corresponding to the target object while suppressing the background region, which is particularly important for few-sample learning. The total loss function of the area recommendation network new class training stage is as follows:

L_total＝L_main+λ_BDL_BD。

and performing two-classification processing on all the boundary frames of each labeled sample, wherein the two-classification processing in the invention is the prior art, and removing the boundary frames obviously not including the detection target to obtain all the candidate areas of each labeled sample.

S200: and constructing each candidate region with the label sample and the corresponding detection target feature obtained in the step S100 into a complete graph, wherein each node in the complete graph represents the feature corresponding to each recommended region, and each edge represents the probability that two connected nodes belong to the same class.

The method comprises the following steps of processing all candidate areas and corresponding detection target features of each labeled sample into a complete graph, taking the target features of each category of a new category as support nodes, taking the features of the detection target corresponding to a candidate recommendation area obtained through a regional recommendation network as query nodes, determining edge features between the nodes according to the categories, setting edge feature values between the support nodes belonging to the same category as 1, setting edge feature values between the support nodes not belonging to the same category as 0, initializing edges between the query nodes and the support nodes as 0.5.

S300: then the complete graph obtained in the step S200 is used as the input of the convolutional neural network, the node features of the complete graph are formed by the convolutional features of the trunk network in the step S100 corresponding to the candidate regions, and the class relationship among the candidate recommendation regions is used as the edge features of the complete graph; intra-cluster similarity and inter-cluster variability are directly exploited.

Obtaining the predicted values of the node characteristics and the edge characteristics through multiple times of node characteristic updating and edge characteristic updating; each updating is to update the node features and the edge features in the complete graph, so that a new complete graph can be formed, and the edge features between the query node and the support nodes in the complete graph updated each time represent the probability that the query node and the support nodes belong to the same class;

when fine tuning is performed on new data, a new regularization method is introduced in the feature extraction stage, activation of background features is inhibited, and the difficulty of fine tuning is reduced.

After N iterations, the predicted value and the true value are used for calculating loss, gradient feedback is carried out, network parameters are updated, when the calculated loss is larger than a threshold value, the iteration times are reset, the network is continuously updated until the loss is not larger than the threshold value, and a trained network model is obtained; the training process of the convolutional neural network is as follows:

wherein, y_iIndicating a class label, y, representing the ith node_jA category label representing the ith node.

Each edge feature is a two-dimensional feature vector e_ij∈[0,1]²The strength of normalized intra-class and inter-class relationships between two connected nodes is shown, so that intra-cluster similarity and inter-cluster dissimilarity can be fully utilized. The node features are initialized by the middle layer features of the recommended area, the middle layer features are convolution features output by the convolution layer at the middle position of the backbone network, and each edge feature is initialized by the edge label according to the following mode:

wherein the content of the first and second substances,

the representation represents a similarity relationship between two nodes;

And edge characteristics

Firstly, updating node characteristics according to a field aggregation process, performing characteristic conversion on the obtained aggregated characteristics by aggregating the characteristics of other nodes and edge characteristics in proportion, and updating the node characteristics of the layer; the feature conversion network is composed of a multi-layer perceptron network, and belongs to the prior art.

Edge characteristics of l-1 layer

Degree coefficient as corresponding node:

wherein the content of the first and second substances,

a representation of a feature transformation network is shown,

and

representing the node characteristics of level l-1 node j,

a parameter representing a feature transformation network of layer l;

the method not only considers intra-class aggregation but also considers inter-class aggregation, and makes full use of the dissimilarity neighbor information and the similar neighbor information provided by the target node.

wherein the content of the first and second substances,

to measure the network, a similarity score is calculated so that the node features flow into the edge features and each element of the edge features is updated separately from each normalized intra-class similarity and inter-class dissimilarity. That is, each edge feature update takes into account not only the relationship of the corresponding node pair, but also the relationship of other node pairs. We can choose to use two separate measurement networks to compute the similarity or dissimilarity of node pairs.

A parameter representing a metric network used to compute the similarity score,

representing the node characteristics at level i for the kth node.

S340: edge prediction labels are ultimately obtained from edge features, i.e.

Can be considered as two nodes V_iAnd V_jProbabilities from the same category. Each node V_iThe classification can be carried out by simply weighting voting on the edge features related to the support nodes of the known category information added when the complete graph is constructed, the simple weighting voting is to sum the edge features of the support nodes belonging to the category and the query node, then a softmax is carried out to obtain the normalized probability of the query node belonging to the category, the category with the highest probability is selected from all the categories to obtain the final category label; the edge label prediction probability is defined as:

representing the probability that the ith node belongs to the kth class.

The loss function in the training process of the convolutional neural network is as follows: in the training process of the S300 convolutional neural network, the node features and the edge features are obtained by training as parameters through a loss function represented by the following minimization formula:

wherein, Y_m,eRepresenting the true values corresponding to all edge labels,

indicating the predicted value of the l layer of the network under the m task of all the edge labels. Edge loss L_eDefined as a binary cross entropy loss. This makes it possible to obtain not only edge predictors from the last layer but also from other layers, so the total loss is the sum of all losses calculated in all layers to improve the gradient flow in the lower layers of the network.

S400: the method comprises the steps of obtaining features of a target to be detected and all candidate areas to be detected by using an image to be detected through the S100 method, obtaining a complete image to be detected through the S200 method, inputting the complete image to be detected into a network model trained through the S300, outputting all node features and corresponding edge features of the image to be detected, enabling the final edge features between nodes to represent the probability that two nodes belong to the same category, adding the edge features of each query node and the support node of each category, and then conducting softmax to obtain the probability of the category, wherein the category with the largest probability is the category corresponding to the target to be detected in the candidate recommended area to be detected. And integrating the region recommendation result in the S100 to obtain a final detection result.

In the invention, the regional recommendation proposal network adopts an SSD mode to carry out regression of the boundary frame, the multi-convolution mode can position target objects with different sizes, and under the condition of limited samples, the adopted method can effectively obtain the boundary frame for the target objects with different sizes in the scene because the data volume is less and enough target objects with various sizes are lacked. Subsequent two-classification with or without targets can further improve the accuracy of the candidate frames, and remove the candidate frames which obviously do not contain the target object, so that the positioning accuracy of the target object under the condition of limited samples can be integrally improved. The classification part of the target object detected by the universal target only aims at the convolution characteristic of the corresponding target boundary frame, and the design of the graph structure can not only utilize the convolution characteristic of the corresponding candidate frame, but also utilize the category relation between each candidate frame and the candidate frame. The graph structure edge comprises similarity relation and dissimilarity relation of two connected bounding boxes, and the mode similar to the attention mechanism not only utilizes aggregation among classes, but also simultaneously utilizes aggregation in the classes, and fully utilizes similarity neighbor information and dissimilarity neighbor information provided by the nodes. Node features may flow into edge features simultaneously when performing edge feature updates. And the available information is fully utilized under the condition of limited samples, so that the classification accuracy of the model on the target area can be greatly improved. The whole frame structure can exert better target detection capability under the condition of limited samples.

Claims

1. A method for detecting a target under a limited sample is characterized by comprising the following steps:

s100: inputting the labeled samples in all the new classes into a backbone neural network, and extracting the characteristics of the detection target;

inputting the labeled samples in all the new classes into a regional recommendation network, wherein the regional recommendation network adopts a boundary box regression part in the SSD and adds a two-classification task with or without a target object behind the boundary box regression part, and a plurality of candidate recommendation regions are obtained for each labeled sample through the regional recommendation network;

s400: the method comprises the steps of obtaining features of a target to be detected and all candidate areas to be detected by using an image to be detected through the S100 method, obtaining a complete image to be detected through the S200 method, inputting the complete image to be detected into a network model trained through the S300 method, outputting all node features and corresponding edge features of the image to be detected, conducting softmax on each query node and supporting nodes of each category and the edge features of each query node and all supporting nodes, obtaining the probability that the query node belongs to the category, and enabling the category with the highest probability to be the category corresponding to the target to be detected in the candidate recommended area to be detected.

2. The method for detecting objects under a limited sample of claim 1, wherein: the area recommendation network in S100 is trained by the following method:

L_total＝L_main+λ_BDL_BD

where L2 regularization is used to penalize F_BDActivation of (2):

L_BD＝||F_BD||₂

L_main＝L_reg+L_cls

said L_BDRepresenting background suppression regularization, F_BDRepresenting a characteristic region corresponding to the background of the image, said L_regRepresents the regression loss, L, of the target bounding box_clsRepresenting a loss of binary class, λ, of a non-target object_BDA weighting coefficient representing a background suppression regularization.

3. The method for detecting objects under limited sample according to claim 1 or 2, wherein the method for processing all candidate regions and corresponding detected object features of each labeled sample into a complete graph in S200 comprises: and taking the target feature of each category of the new category as a support node, taking the feature of a detection target corresponding to a candidate recommendation area obtained through the regional recommendation network as a query node, determining edge features between the nodes according to the categories between the nodes, wherein the edge feature value between the support nodes belonging to the same category is 1, and the edge feature value between the support nodes not belonging to the same category is 0.

4. The method for detecting the target under the limited sample according to claim 3, wherein the training process of the S300 convolutional neural network is as follows:

s310: let G denote the complete graph, v_iAnd e_ijRespectively representing the ith node feature in the node feature set and the edge feature between the ith node and the jth node in the edge feature set, and the true value y of each edge label_ijDefined by the true values of the node features:

wherein the content of the first and second substances,

representing a similarity relationship between two nodes.

And edge characteristics

edge characteristics of l-1 layer

Degree coefficient as corresponding node:

wherein the content of the first and second substances,

a representation of a feature transformation network is shown,

and

representing the node characteristics of level l-1 node j,

a parameter representing a feature transformation network of layer l;

wherein the content of the first and second substances,

to measure the network, a similarity score is calculated,

a parameter representing a metric network used to compute the similarity score,

is shown asThe similarity scores of the i nodes and the j node at the l-1 level,

a node characteristic at level l representing a kth node;

s340: edge prediction labels are ultimately obtained from edge features, i.e.

Each node V_iThe classification can be carried out by simply weighting voting on the edge features related to the support nodes with known category information added when the complete graph is constructed, the simple weighting voting is to sum the edge features of the support nodes belonging to the category and the query nodes, then a softmax is carried out to obtain the normalized probability of the query nodes belonging to the category, the category with the maximum probability is selected from all the categories to obtain the final category label; the edge label prediction probability is defined as:

representing the probability that the ith node belongs to the kth class.

5. The method for detecting the target under the finite sample as set forth in claim 4, wherein the loss function in the training process of the S300 convolutional neural network is:

wherein, Y_m,eRepresenting the true values corresponding to all edge labels,