CN113705570B

CN113705570B - Deep learning-based few-sample target detection method

Info

Publication number: CN113705570B
Application number: CN202111012122.9A
Authority: CN
Inventors: 李峰; 蒲怀建; 章登勇; 彭建; 赵乙芳
Original assignee: Changsha University of Science and Technology
Current assignee: Changsha University of Science and Technology
Priority date: 2021-08-31
Filing date: 2021-08-31
Publication date: 2023-12-08
Anticipated expiration: 2041-08-31
Also published as: CN113705570A

Abstract

The invention discloses a method for detecting a few-sample target based on deep learning, which comprises the steps of acquiring a data set of the few-sample target detection; dividing the data set into a training set and a verification set, and dividing the training set into a support set and an inquiry set; constructing a target detection network model and a target loss function of the target detection network model; training the target detection network model according to the training set and the target loss function to obtain the target detection network model after training is completed; and verifying the target detection network model after training is completed according to the verification set. The invention can improve the accuracy and generalization of the network model for detecting the target with few samples, and the target detection method also combines deep learning, uses the double-stage target detection requiring candidate frames, and can improve the accuracy of the detection with few samples.

Description

Deep learning-based few-sample target detection method

Technical Field

The invention relates to the technical field of image processing, in particular to a method for detecting a small sample target based on deep learning.

Background

Target detection in the field of computer vision is widely used in the fields of military, industrial production, intelligent monitoring and the like. The object detection extends on the basis of the image classification and mainly comprises the steps of identifying objects contained in the images and calibrating the positions of the objects. Previously, researchers have generally used the detection method of the conventional non-convolutional neural network to detect targets due to the processing speed and memory limitations of the computer, but with the rapid development of the computer processing speed and memory, deep learning has become a viable method. Moreover, the target detection method based on the deep neural network is superior to the traditional target detection method in detection efficiency and accuracy.

Currently, the task of target detection relies heavily on a large number of marked data sets for training, and in general target detection training, samples of each class often have thousands of images, but in practical applications, we find that the data of some objects are very little or difficult to acquire. When the marked data is missing, a very poor generalization ability will result, leading to a very low detection accuracy or no detection at all.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a method for detecting a target with few samples based on deep learning, which can realize high-precision detection with few samples.

In order to achieve the above purpose, the technical scheme of the invention is realized as follows:

a method for detecting a few-sample target based on deep learning comprises the following steps:

acquiring a data set of few sample target detection; the data set of the few-sample target detection comprises a plurality of picture categories, and the number of pictures corresponding to each picture category is smaller than a first preset value;

dividing the data set into a training set and a verification set, and dividing the training set into a support set and an inquiry set;

constructing an object detection network model and an object loss function of the object detection network model, wherein the object detection network model is constructed by the following steps:

extracting a feature map of the support set and a feature map of the query set through a weight sharing feature extraction network; global average pooling is carried out on the feature graphs of the support set to obtain feature vectors of the support set; feature fusion is carried out on the feature vectors of the support set and the feature graphs of the query set to obtain an attention feature graph; extracting a first target candidate frame in the attention characteristic diagram; extracting a second target candidate frame in the feature map of the support set according to the real label; unifying the sizes of the first target candidate frame and the second target candidate frame, and respectively acquiring a first candidate region vector of the first target candidate frame and a second candidate region vector of the second target candidate frame according to a reshape; carrying out similarity measurement on the first candidate region vector and the second candidate region vector according to a Pearson distance function to obtain a category with the most similar measurement;

training the target detection network model according to the training set and the target loss function to obtain the target detection network model after training is completed;

and verifying the target detection network model after training is completed according to the verification set.

According to some embodiments of the invention, the acquiring a dataset of small sample target detections comprises the steps of: and selecting pictures with correct labels and mark frames larger than 0.05% of the picture size from the Microsoft COCO data set as the data set for detecting the few-sample targets.

According to some embodiments of the invention, the dividing the training set into a support set and an inquiry set comprises the steps of: and selecting five pictures with target category areas larger than a second preset value from the training set as the support set, and taking all the pictures of the training set as the query set.

According to some embodiments of the invention, the extracting the first target candidate box in the attention profile includes the steps of: the first target candidate box in the attention profile is extracted using a region extraction network in Faster RCNN.

According to some embodiments of the invention, the weight sharing feature extraction network comprises a dark net53 network and a dash activation function.

According to some embodiments of the invention, the feature fusion of the feature vector of the support set and the feature map of the query set to obtain an attention feature map includes the steps of: and carrying out feature fusion on the feature vector of the support set and the feature map of the query set through channel convolution to obtain an attention feature map.

According to some embodiments of the invention, unifying the sizes of the first target candidate frame and the second target candidate frame includes the steps of: and scaling the first target candidate frame and the second target candidate frame by using the ROI alignment respectively to obtain the first target candidate frame and the second target candidate frame after the unified size.

According to some embodiments of the invention, the target loss function is formulated as:

wherein b represents the serial number of each picture, and p is _b Representing the probability that the picture order is b, saidRepresenting a label, said->Has a value of 0 or 1, t is _b Four parameters representing a prediction box, v _b A label representing a prediction box, said L _cls Representing a classification loss function, said L _loc Representing a positioning loss function, said N _cls And said N _loc Respectively represent the classification loss function and the determinationThe normalized coefficient of the bit loss function, λ, represents the weight parameter between the two.

According to some embodiments of the invention, the training the target detection network model according to the training set and the target loss function to obtain the target detection network model after training, includes the steps of:

pre-training the target detection network model by using a Pascal VOC2007 data set to obtain an initial weight;

training the target detection network model by using a training set;

and minimizing the target loss function by adopting a gradient descent method, and performing layer-by-layer reverse adjustment on the initial weight in the target detection network model to obtain the target detection network model after training is completed.

According to some embodiments of the invention, the verifying the trained object detection network model according to the verification set includes the steps of:

inputting the picture of the verification set into the target detection model after training is completed, and obtaining the target category and the position coordinate of the picture of the verification set;

and comparing the target category and the position coordinate of the verification set picture with the label of the verification set, and evaluating the accuracy of a target detection model through an average accuracy index.

Compared with the prior art, the invention has the following beneficial effects:

the target detection method screens pictures in the data set, acquires the data set of target detection with few samples by selecting pictures with the number less than a first preset value, and then constructs a target detection network model based on a small number of samples. The method has the advantages that the Pearson distance function is used for measurement in the construction of the target detection network model, the correct classification of pictures is facilitated, the weight sharing characteristic extraction network can improve the accuracy and generalization capability of the target detection network model with few samples, the target detection method also combines deep learning, the double-stage target detection requiring candidate frames is used, and the accuracy of the target detection with few samples can be improved.

Drawings

The foregoing and/or additional aspects and advantages of the invention will become apparent and may be better understood from the following description of embodiments taken in conjunction with the accompanying drawings in which:

FIG. 1 is a flow chart of a method for deep learning based low sample target detection according to one embodiment of the present invention;

FIG. 2 is a flow chart of a method for deep learning-based low sample target detection according to another embodiment of the present invention;

FIG. 3 is a flow chart of an area extraction network according to one embodiment of the present invention;

fig. 4 is an internal structure diagram of an area extraction network according to an embodiment of the present invention.

Detailed Description

The technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present disclosure, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments of the present disclosure without making any inventive effort, are intended to be within the scope of the present disclosure. It should be noted that, without conflict, the embodiments of the present disclosure and features of the embodiments may be combined with each other. In addition, the drawings are used for supplementing the description of the text part of the specification by using figures so that a person can intuitively and intuitively understand each technical feature and the whole technical scheme of the present disclosure, but cannot understand the limitation of the protection scope of the present disclosure.

Today, object detection is heavily dependent on a large number of labeled data sets for training, and in general object detection training, there are often thousands of images for each class of samples, but in practical applications, some objects have little or no data. When the marked data is missing, a very poor generalization ability will result, leading to a very low detection accuracy or no detection at all.

The embodiment of the invention discloses a method for detecting a target with few samples based on deep learning, which can effectively solve the problems of low detection precision or incapability of detection when the target sample data to be detected is insufficient.

Referring to fig. 1 to 4, the embodiment provides a method for detecting a target with a small sample based on deep learning, which comprises the following specific implementation steps:

step S100, acquiring a data set of few sample target detection; the data set of the few-sample target detection comprises a plurality of picture categories, and the number of pictures corresponding to each picture category is smaller than a first preset value;

specifically, a dataset of target detection with few samples is obtained, taking a Microsoft COCO dataset as an example, the Microsoft COCO dataset contains 80 categories in total, pictures in the Microsoft COCO dataset are screened and classified in multiple levels, leaf tags with the same semantics are classified into one category, semantics which do not belong to any leaf category are deleted, pictures with incorrect tags and marking frames less than or equal to 0.05% of the size of the pictures are deleted, the number of the selected pictures is less than a first preset value, and the first preset value can be set according to the number of the pictures selected according to actual needs.

Step 200, the data set is divided into a training set and a verification set, and the training set is divided into a support set and an inquiry set.

Specifically, 20 categories with dissimilar categories are selected from the dataset of the few sample target detection as a verification set, and the remaining 60 categories are used as training sets. In the training process, selecting five pictures with target category areas larger than a second preset value from the training set as a support set, and taking the pictures of all the training sets as query sets. The sizes of the target category areas of the pictures in the training set are ordered, a second preset value is set according to the number of the pictures to be selected, and in the embodiment, the sizes of the target category areas of the pictures in the training set are ordered from large to small, and the first five pictures are selected.

Step S300, constructing an object detection network model and an object loss function of the object detection network model, wherein the object detection network model is constructed by the following steps:

extracting a feature map of a support set and a feature map of an inquiry set through a weight sharing feature extraction network; carrying out global average pooling on the feature graphs of the support set to obtain feature vectors of the support set; feature fusion is carried out on the feature vectors of the support set and the feature images of the query set to obtain an attention feature image; extracting a first target candidate frame in the attention characteristic diagram; extracting a second target candidate frame in the feature map of the support set according to the real label; unifying the sizes of the first target candidate frame and the second target candidate frame, and respectively acquiring a first candidate region vector of the first target candidate frame and a second candidate region vector of the second target candidate frame according to the reshape; carrying out similarity measurement on the first candidate region vector and the second candidate region vector according to the Pelson distance function to obtain a category with the most similar measurement;

specifically, a Darknet53 network, a Mish activation function and the like are adopted to construct a weight sharing feature extraction network, and the weight sharing feature extraction network is used for carrying out feature extraction on a support set and an inquiry set to obtain feature graphs of the support set and the inquiry set; global average pooling is carried out on the feature images of the support set to obtain feature vectors of the support set, feature fusion is carried out on the feature vectors of the support set and the feature images of the query set through channel convolution, and the attention feature images are obtained; extracting a first target candidate frame in the attention feature map by adopting a region extraction network in a Faster RCNN; acquiring a second target candidate frame from the support set feature map by adopting a real tag of the support set feature map; scaling the first target candidate frame and the second target candidate frame by using the ROI alignment to obtain a first target candidate frame and a second target candidate frame with uniform sizes, and respectively obtaining a first candidate region vector of the first target candidate frame and a second candidate region vector of the second target candidate frame through reshape after unifying the first target candidate frame and the second target candidate frame; and measuring the first candidate region vector and the second candidate region vector by adopting a Pelson distance function, calculating the similarity between the first candidate region vector and the second candidate region vector by normalizing the average value of each dimension, and classifying the candidate frames in the query set. Wherein, the Mish activation function, the region extraction network in the Faster RCNN, the ROI Align, and the reshape are well known to those skilled in the art, and the Pearson distance formula is:

wherein,represents the pearson distance value, c _n And->Represents the first candidate region vector and the second candidate region vector, respectively, d represents the dimensions of the first candidate region vector and the second candidate region vector, +.>And->Representing the average of the first candidate region vector and the second candidate region vector for d dimensions, respectively.

And constructing an objective loss function of the objective detection network model, wherein the objective loss function of the objective detection network model comprises a classification loss function and a positioning loss function. Wherein the target loss function expression is:

where b represents the sequence number of each picture. P is p _b The probability that the picture order is b is indicated,representing a sample tag, if the sample tag is positive, < +.>1, if the sample tag is negative, +.>Is 0, t _b Four parameters representing prediction blocks, v _b Label, L representing prediction frame _cls Representing a classification loss function, L _loc Representing a positioning loss function, N _cls And N _loc Normalized coefficients respectively representing a classification loss function and a positioning loss function, N being set in this embodiment _cls 256 and N _loc For 2400, λ represents a weight parameter between the two, and λ is set to 10 in this embodiment.

Wherein, the classification loss function expression is:

n represents the training class of the support set, which in this embodiment is set to 60,indicating that the target in the query set is a real tag, +.>Is a pearson distance value.

The positioning loss function expression is:

t represents a predicted value and v represents a sample tag.

In this embodiment, better accuracy and generalization can be obtained by adopting the Mish activation function. The first target candidate frame of the attention feature map is extracted by using the area extraction network, so that most background frames and the background frames of unmatched categories can be filtered, and the accuracy of the target detection network is improved. Scaling the target candidate box using ROI alignment makes the target detection network more accurate.

And step 400, training the target detection network model according to the training set and the target loss function to obtain the target detection network model after training.

Specifically, firstly, pre-training a target detection network model by using a Pascal VOC2007 data set to obtain an initial weight; then, training the target detection network model by using a training set; and finally, minimizing a target loss function by adopting a gradient descent method, and reversely adjusting initial weights in the network layer by layer to obtain a final trained network model. Gradient descent methods are well known to those skilled in the art and will not be described in detail herein. Because the categories of the training set and the verification set are not overlapped, fine adjustment is not needed by the category of the verification set in the training process, only one training is needed, and the training process is simplified.

And step 500, verifying the trained target detection network model according to the verification set.

Specifically, in the verification process, the same operation as the training set is performed on the verification set, the verification set is divided into a support set and an inquiry set, and then the support set and the inquiry set are input into a target detection model after training is completed for prediction, so that the target category and the position coordinate of a picture of the verification set are obtained; the target class and position coordinates of the validation set picture are compared with the labels of the validation set and the accuracy of the target detection network model is determined by computing the AP 50. The AP50 indicates that the target tag includes a category tag and a location tag, the location tag is a rectangular frame, the target detection model obtains a rectangular frame when predicting, when the intersection of the rectangular frame and the rectangular frame of the location tag divided by the union between the two is greater than or equal to 0.5, the rectangular frame predicted by the target detection model is considered correct, otherwise, the rectangular frame predicted by the target detection model is considered incorrect.

In the description of the present specification, reference to the terms "one embodiment," "some embodiments," "illustrative embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While the preferred embodiment of the present invention has been described in detail, the present invention is not limited to the above embodiments, and those skilled in the art can make various equivalent modifications or substitutions without departing from the spirit and scope of the present invention, and these equivalent modifications or substitutions are included in the scope of the present invention as defined in the appended claims.

Claims

1. The method for detecting the target with few samples based on deep learning is characterized by comprising the following steps of:

2. The deep learning based few sample target detection method of claim 1, wherein the acquiring a data set of few sample target detections comprises the steps of: and selecting pictures with correct labels and mark frames larger than 0.05% of the picture size from the Microsoft COCO data set as the data set for detecting the few-sample targets.

3. The deep learning based few sample target detection method of claim 1, wherein the dividing the training set into a support set and an inquiry set comprises the steps of: and selecting five pictures with target category areas larger than a second preset value from the training set as the support set, and taking all the pictures of the training set as the query set.

4. The deep learning based few sample target detection method of claim 1, wherein the extracting the first target candidate box in the attention profile comprises the steps of: the first target candidate box in the attention profile is extracted using a region extraction network in Faster RCNN.

5. The deep learning based few sample target detection method of claim 1, wherein the weight sharing feature extraction network comprises a dark net53 network and a dash activation function.

6. The method for deep learning based small sample object detection according to claim 1, wherein the feature fusion of the feature vector of the support set and the feature map of the query set to obtain an attention feature map comprises the steps of: and carrying out feature fusion on the feature vector of the support set and the feature map of the query set through channel convolution to obtain an attention feature map.

7. The deep learning-based few-sample target detection method according to claim 1, wherein unifying the sizes of the first target candidate frame and the second target candidate frame comprises the steps of: and scaling the first target candidate frame and the second target candidate frame by using the ROI alignment respectively to obtain the first target candidate frame and the second target candidate frame after the unified size.

8. The deep learning based few sample target detection method of claim 1, wherein the formula of the target loss function is:

wherein b represents the serial number of each picture, and p is _b Representing the probability that the picture order is b, saidRepresenting a label, said->Has a value of 0 or 1, t is _b Four parameters representing a prediction box, v _b A label representing a prediction box, said L _cls Representing a classification loss function, said L _loc Representing a positioning loss function, said N _cls And said N _loc The normalized coefficients of the classification loss function and the positioning loss function are represented, respectively, with λ representing the weight parameter between the two.

9. The method for deep learning based small sample target detection according to claim 1, wherein the training the target detection network model according to the training set and the target loss function to obtain the target detection network model after training is completed, comprises the steps of:

training the target detection network model by using a training set;

10. The deep learning-based few-sample target detection method according to claim 1, wherein the verifying the trained target detection network model according to the verification set comprises the steps of: