WO2024032010A1

WO2024032010A1 - Transfer learning strategy-based real-time few-shot object detection method

Info

Publication number: WO2024032010A1
Application number: PCT/CN2023/086781
Authority: WO
Inventors: 李国权; 夏瑞阳; 林金朝; 庞宇; 朱宏钰
Original assignee: 重庆邮电大学
Priority date: 2022-08-11
Filing date: 2023-04-07
Publication date: 2024-02-15
Also published as: CN115393634A; CN115393634B

Abstract

The present invention relates to the field of image processing, and relates to a transfer learning strategy-based real-time few-shot object detection method, comprising the following steps: S1: constructing a detection network model; S2: preprocessing input data; S3: training an object detection model from scratch by using large-sample class data; S4: fine-tuning a few-shot class detection branch by using few-shot class data; and during the fine-tuning, using a new regularization method to guide the model to pay attention to a global feature of an object; and S5: training the detection model by means of a training set, and carrying out a test by using a test set. The present invention avoids overfitting of a model in a fine-tuning stage, avoids dominance by local salient features, and enhances the generalization capability of the model. The present invention not only can achieve accurate detection on few-shot class objects using fewer model parameters, but also can achieve real-time detection of related objects.

Description

A real-time detection method of few-sample targets based on transfer learning strategy

Technical field

The invention belongs to the field of image processing and relates to a real-time detection method of few-sample targets based on a transfer learning strategy.

Background technique

Object detection is one of the most important and fundamental tasks in computer vision. There are many detectors based on Convolutional Neural Network (CNN) or visual Transformer with high detection performance. However, the excellent detection performance of these models is achieved at the expense of large amounts of data. Due to the complexity of the object and the large number of model parameters, the detection accuracy will drop rapidly when the amount of data is limited. Therefore, few-shot target detection has received more and more attention in recent years.

In order to better adapt to scenarios where the number of samples is limited, there are currently some few-sample target detection models based on meta-learning strategies and transfer learning strategies. The purpose of the method based on meta-learning strategy is to obtain the correlation between the current image and the few samples. Although the detection performance for the few samples has been improved, due to the feature extraction structure, input features and few sample features in the minority sample detection branch, The structure of the relationship between them and the number of small sample categories have resulted in a greatly increased computational complexity of the model. The purpose of the method based on the transfer learning strategy is to enable the detection model that already has feature representation capabilities to be well adapted to the few-sample target. However, in order to improve detection accuracy, most methods focus on two-stage detection models, such as Faster RCNN or Cascade RCNN. Since the images input to these models are large and the proposal boxes need to be generated in the Region Proposal Network (RPN), resulting in This type of detection model is time-consuming in the inference phase.

Contents of the invention

In view of this, the purpose of the present invention is to provide a two-way combined real-time target detection model, based on the transfer learning strategy, using Darknet-53 combined with Spatial Pyramid Pooling (SPP) and Feature Pyramid Network (Feature Pyramid). Network, FPN) as the backbone and neck, respectively extract image features and provide semantic features at different scales. For detecting the head structure, a dual-path detection branch with a discriminator is proposed. The large-sample category detection branch is only used to detect large-sample category objects, while the few-sample category detection branch is used to detect all categories of objects. After outputting the detection results in parallel, the discriminator will scan the two results and output the more appropriate result of the two parallel branches based on a metric criterion. The main reason for using the dual-path combination structure is that when the model is trained on a small number of samples, the detection accuracy of objects in the large sample category will degrade, and the few sample detection branch will have false positive bounding boxes that actually belong to the large sample category. In addition, the few-sample detection branch also learns the prediction differences of large-sample categories from the large-sample detection branch through knowledge distillation, thereby improving the generalization ability of the detection branch. Finally, in order to avoid overfitting of the model in the fine-tuning stage, the present invention proposes a feature-based response The Attentive DropBlock regularization method is used to guide the model to focus on the overall characteristics of the target, avoid being dominated by local salient features, and enhance the generalization ability of the model.

In order to achieve the above objects, the present invention provides the following technical solutions:

A real-time detection method of few-sample targets based on transfer learning strategy, including the following steps:

S1: Build a detection network model;

S2: Preprocess the input data;

S3: Train the target detection model from scratch on large sample category data;

S4: Fine-tune the few-sample category detection branch on the few-sample category data; use a new regularization method to guide the model to focus on the overall characteristics of the object during fine-tuning;

S5: Train the detection model through the training set, and then test it on the test set.

Further, the detection network model includes: the backbone network is Darknet-53 combined with Spatial Pyramid Pooling (SPP), which is used to extract image features; the detection neck network is composed of Feature Pyramid Network (Feature Pyramid Network, FPN), used to provide semantic features of different scales to the detection head network; the detection head network is a dual-channel detection branch network structure with a discriminator, in which the large sample category detection branch is only used to detect categories corresponding to large samples The target, few-shot category detection branch is used to detect all categories of targets, and the discriminator is used to scan the results of the two branches in sequence and obtain the final output result according to a measurement criterion.

Further, the preprocessing described in step S2 is specifically: processing limited data by using random affine transformation, multi-scale image training strategy, MixUp data fusion strategy and Label Smoothing label processing strategy.

Further, in step S3, the backbone network is initialized to the weights trained on the ImageNet data set, and the network model except the few-sample detection branch is trained from scratch using large-sample category data. The loss function at this stage involves prediction box coordinates, target confidence and Classification results, the loss function is:
L _{base training} ＝L _box +L _cls +L _obj (1)

Among them, L _box is the additive combination of the GIoU loss function and smooth L1 loss of coordinate regression; L _cls and L _obj are the Focal Loss function and the binary cross-entropy loss function respectively.

Further, in step S4, the model parameters of the main part of the detection model, the detection neck part and the large sample category detection branch part are frozen, and only the small sample category detection branch is fine-tuned. The loss function at this stage involves the coordinates of the prediction frame , target confidence, classification results and the difference of large sample category detection branches.

Further, step S4 specifically includes the following steps:

S41: Establish a base class distillation loss L _b between the large-sample category detection branch and the few-sample detection branch. The calculation formula is as follows:

Among them, N represents the batch size, l represents the absolute error function, and Represents the output of the i-th image in the large-sample detection branch and the few-sample category detection branch respectively;

S42: The loss function fine-tuned on a few samples is:
L _{few-shot tuning} ＝L _box +2L _cls +L _obj +λ·L _b (3)

Among them, λ is used to control the impact of base class distillation loss on model gradient update;

S43: Add a discriminator after the large sample category detection branch and the few sample detection branch, and the discriminator selects the result of the large sample category detection branch and few-shot category detection branch results The maximum value between them is used as the final output, and its measurement criterion is as follows:

where O _d (i, j) represents the discriminator output of a specific spatial grid.

Further, the new regularization method is the Attentive DropBlock algorithm, which has a dynamic coefficient γ, as shown below:

Among them, the parameters keep_prob and block_size affect the frequency and range of the feature map being set to zero, σ represents the sigmoid function, which is used to control the response range, and α represents the response amplification factor.

Further, the Attentive DropBlock algorithm first determines whether it is currently in the fine-tuning stage. If the model is fine-tuning, obtain the channel response f _C and spatial response f _S of the few-sample category detection branch; then, calculate the parameter γ according to the parameters keep_prob, block_size and α. Finally, the spatial position of each different channel feature is set to zero according to the Bernoulli distribution probability with parameter γ; finally, with the zero position as the center, a mask block with a length and width value of block_size is constructed, so that Regularize the model.

Further, in step S5, train and test on the PASCAL VOC and MS COCO data sets;

For the PASCAL VOC data set, the training set and the verification set are first merged into one set for training to detect the magic heart, and then its test set is selected for testing. The test evaluation standard adopts the Intersection over Union (IoU) threshold of 0.5 The mean Average Precision (mAP) (i.e. mAP@50) and the average number of frames per second (mean Frames Per Second, mFPS) of multiple different small sample collections represent the detection accuracy and speed of the detection model;

For the MS COCO data set, only its training set is used for training, and its validation set is used for verification. IoU is used from 0.5 to 0.95 (interval is 0.05) mAP (i.e. AP) and frames per second (Frames Per Second, FPS) ) represents the detection accuracy and speed of the detection model.

Furthermore, during the training process of step S5, stochastic gradient descent is used as the optimization method of the network model, the initial learning rate is 1×10 ^-3 , and the set minimum batch size is 16 in different data sets; for PASCAL VOC and MS COCO Data set, the number of times of initial training and fine-tuning of the detection model is 300, and the CosineLR learning rate change strategy (from 0.001 to 0.00001) is used during the training process; during the prediction process, the length and width of the input image are fixed at 448×448; FPS To obtain the sum of the waiting time for each result and the time for post-processing the results, mFPS is the average FPS under different few-sample sets.

The beneficial effects of the present invention are: the present invention proposes an Attentive DropBlock regularization method based on feature response to guide the model to pay attention to the overall characteristics of the object, avoid over-fitting of the model in the fine-tuning stage, avoid being dominated by local salient features, and enhance Due to the generalization ability of the model, the present invention can not only achieve accurate detection of few-sample category objects under smaller model parameters, but also achieve real-time detection of related targets.

Other advantages, objects, and features of the present invention will, to the extent that they are set forth in the description that follows, and to the extent that they will become apparent to those skilled in the art upon examination of the following, or may be derived from This invention is taught by practicing it. The objects and other advantages of the invention may be realized and obtained by the following description.

Description of drawings

In order to make the purpose, technical solutions and advantages of the present invention clearer, the present invention will be described in detail below in conjunction with the accompanying drawings, in which:

Figure 1 is an overall flow chart of the model proposed by the present invention;

Figure 2 is a visual comparison chart of DropBlock algorithm and Attentive DropBlock algorithm;

Figure 3 is a diagram showing the visual detection results of large sample and small sample category objects by the model proposed by the present invention;

Figure 4 shows the response to the target and the visual detection results of the large-sample category detection branch and the few-sample category detection branch of the model proposed by the present invention.

Detailed ways

The following describes the embodiments of the present invention through specific examples. Those skilled in the art can easily understand other advantages and effects of the present invention from the content disclosed in this specification. The present invention can also be implemented or applied through other different specific embodiments. Various details in this specification can also be modified or changed in various ways based on different viewpoints and applications without departing from the spirit of the present invention. It should be noted that the illustrations provided in the following embodiments only illustrate the basic concept of the present invention in a schematic manner. The following embodiments and the features in the embodiments can be combined with each other as long as there is no conflict.

The drawings are only for illustrative purposes, and represent only schematic diagrams rather than actual drawings, which cannot be understood as limitations of the present invention. In order to better illustrate the embodiments of the present invention, some components of the drawings will be omitted. The enlargement or reduction does not represent the size of the actual product; it is understandable to those skilled in the art that some well-known structures and their descriptions may be omitted in the drawings.

In the drawings of the embodiments of the present invention, the same or similar numbers correspond to the same or similar components; in the description of the present invention, it should be understood that if there are terms "upper", "lower", "left" and "right" The orientation or positional relationship indicated by "front", "rear", etc. is based on the orientation or positional relationship shown in the drawings, and is only for the convenience of describing the present invention and simplifying the description, and does not indicate or imply that the device or element referred to must be It has a specific orientation and is constructed and operated in a specific orientation. Therefore, the terms describing the positional relationships in the drawings are only for illustrative purposes and cannot be understood as limitations of the present invention. For those of ordinary skill in the art, they can determine the specific position according to the specific orientation. Understand the specific meaning of the above terms.

Please refer to Figures 1 to 4, a real-time detection method of few-sample targets based on transfer learning strategy. The method includes the following steps:

S1: Preprocess the input data;

S2: Train the target detection model (except the few-sample detection branch) from scratch on large-sample category data;

S3: Fine-tune the few-shot category detection branch on the few-shot category data;

S4: Introduce a new regularization method in the fine-tuning stage to guide the model to focus on the overall characteristics of the object;

S5: Conduct experiments on the natural data set PASCAL VOC 2007 and MS COCO 2014 data set;

Optionally, the S1 specifically includes the following steps:

By using random affine transformation, multi-scale image training strategy (320, 352, 384, 416, 448, 480, 512, 544, 576 and 608), MixUp data fusion strategy and Label Smoothing label processing strategy to conduct limited data Processing, thereby increasing the generalization performance of the detection model to the sample.

Optionally, in S2, in order to make the model have strong target representation ability, the entire network except the few-sample detection branch is trained from scratch using large-sample category data. Therefore, the loss function of the entire network training in the first stage is:
L _{basw training} ＝L _box +L _cls +L _obj (1)

Among them, L _box is the additive combination of the GIoU loss function of coordinate regression and the smooth L1 loss. L _cls and L _obj are the Focal Loss function and the binary cross-entropy loss function respectively.

Optionally, in the S3, in the fine-tuning phase of few samples, the backbone, detection neck and large sample detection branches are frozen to maintain strong generalization ability, and only the few sample detection branches and SPP layers and their adjacent volumes are Stacked layers for training. However, when only new classes of objects are adopted, many false positive bounding boxes are generated, resulting in low detection accuracy due to the similarity between objects in the two classes. Therefore, we randomly sample K instances from the corresponding data for each large-sample category, so that the few-shot detection branch predicts all categories of objects. In addition, considering that the large-sample category detection branch has strong generalization ability, in order to obtain better generalization ability, the few-sample detection branch should learn this branch to obtain better generalization ability. Therefore, we establish the base class distillation loss L _b between the two branches, and the calculation formula is as follows:

Among them, N represents the batch size. l is the sum of absolute error functions. and Represents the output of the i-th image in the large-sample detection branch and the few-sample detection branch respectively. Therefore, the loss function fine-tuned on few samples can be summarized as:
L _{few-shot tuning} ＝L _box +2L _cls +L _obj +λ·L _b (3)

Among them, λ is used to control the impact of base class distillation loss on model gradient update.

During the inference phase, two parallel branches are used to jointly detect objects. However, analyzing the outputs of both branches at the same time will seriously lengthen the inference process. Therefore, we add a discriminator behind these two branches to choose the most likely outcome of the two outputs. Specifically, the discriminator will select the large sample class detection branch result And the results of the few-shot category detection branch The maximum value between them is used as the final output. Its measurement criteria are as follows:

Optionally, in S4, in order to further improve the model's generalization ability for few-sample categories, the present invention proposes an Attentive DropBlock algorithm. This algorithm is not only affected by the parameters keep_prob and block_size, but also affected by the model's semantic features. Impact of response. Specifically, the DropBlock algorithm sets a constant coefficient for all locations within the feature map, as follows:

Among them, the parameters keep_prob and block_size affect the frequency and range of feature zeroing. Different from the original DropBlock, γ is a dynamic coefficient that relies on the feature map response extracted in the Attentive DropBlock algorithm. Specifically, consider a feature map F∈R ^B×C×H×W , adopt the global max pooling function for each channel feature to obtain the response f _C∈R ^B×C×1×1 , and adopt The global average pooling function yields the response f _S ∈R ^B×1×H×W . Therefore, the calculation formula of γ in the Attentive DropBlock algorithm is as follows:

Among them, σ represents the sigmoid function used to control the response range, and α represents the response amplification factor.

The Attentive DropBlock algorithm will first determine whether the model is currently in the fine-tuning stage. If the model is fine-tuning, obtain the channel response f _C and spatial response f _S of the few-sample category detection branch. Afterwards, after calculating the parameter γ based on the two responses, keep_prob, block_size and α, the spatial position of each different channel feature is set to zero according to the Bernoulli distribution probability with the parameter γ. Finally, with the zero position as the center, a mask block with a length and width of block_size is constructed to regularize the model.

Figure 2 shows the difference between DropBlock and Attentive DropBlock. It can be observed that Attentive The gamma value in DropBlock is related to the target response. Feature maps that contain more target responses have higher γ values, which means that the detection model can better avoid being dominated by local obvious features and thus pay more attention to unobvious features during the training process, thereby obtaining better results. Sample target detection accuracy.

Optionally, in the S5, for the PASCAL VOC data set, three different data combination structures are obtained in such a way that 15 categories are large-sample categories and the remaining 5 categories are few-sample categories (the first few-sample category includes Birds, buses, cows, motorcycles, and sofas; the second few-shot category includes airplanes, bottles, cows, horses, and sofas; the third few-shot category includes boats, cats, motorcycles, sheep, and sofas); for MS In the COCO data set, the 20 categories that are the same as those in the PASCAL VOC data set are small-sample categories, and the remaining 60 categories are large-sample categories. During the training process, the present invention uses stochastic gradient descent as the optimization method of the network model, the initial learning rate is 1×10 ^-3 , and the set minimum batch size is 16 in different data sets. For these two data sets, the number of times the model was trained from scratch and fine-tuned was 300, and the CosineLR learning rate change strategy (from 0.001 to 0.00001) was used during the training process. During the prediction process, the length and width of the input image are fixed at 448×448.

Experimental results

In this example, the present invention compares the detection accuracy and detection speed of various few-sample target detection models proposed in recent years on the PASCAL VOC 2007 and MS COCO 2014 data sets. Specifically, the detection model of the present invention was evaluated on the challenging PASCAL VOC 2007 and MS COCO 2014 data sets according to the evaluation criteria specified in the PASCAL VOC and MS COCO data. These two benchmark data contain training sets, validation sets and test sets. The PASCAL VOC 2007 data set contains 20 target categories, and the MS COCO 2014 data set contains 80 categories. For the former, the present invention first combines the PASCAL VOC 2007 and PASCAL VOC 2012 training sets and verification sets into one set for training the detection model, and selects the PASCAL VOC 2007 test set for testing. The test evaluation standard adopts the Intersection Ratio (Intersection). The detection model is represented by the mean Average Precision (mAP) (i.e. mAP@50) with a threshold of 0.5 over Union (IoU) and the average number of frames per second (mean Frames Per Second, mFPS) of multiple different few sample sets. detection accuracy and speed. For the latter, the present invention only uses the MS COCO 2014 training set for training, and uses its verification set for verification in the test phase, using the mAP (i.e. AP) of IoU from 0.5 to 0.95 (interval is 0.05) and the number of transmission frames per second (Frames Per Second, FPS) represents the detection accuracy and speed of the detection model.

Table 1

Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and are not limiting. Although the present invention has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that the technical solutions of the present invention can be modified. Modifications or equivalent substitutions without departing from the purpose and scope of the technical solution shall be included in the scope of the claims of the present invention.

Claims

A real-time detection method of few-sample targets based on transfer learning strategy, which is characterized by: including the following steps:

S1: Build a detection network model;

S2: Preprocess the input data;

S3: Train the target detection model from scratch on large sample category data;

S4: Fine-tune the few-sample category detection branch on the few-sample category data; use a new regularization method to guide the model to focus on the overall characteristics of the object during fine-tuning;

S5: Train the detection model through the training set, and then test it on the test set.
The real-time detection method of few-sample targets based on transfer learning strategy according to claim 1, characterized in that: the detection network model includes: the backbone network is Darknet-53 combined with a spatial pyramid pooling layer, used to extract image features ; The detection neck network is composed of a feature pyramid network, which is used to provide semantic features of different scales to the detection head network; the detection head network is a dual-path detection branch network structure with a discriminator, in which the large sample category detection branch only uses For detecting category targets corresponding to large samples, the few-sample category detection branch is used to detect all category targets, and the discriminator is used to scan the results of the two branches in sequence, and obtain the final output result according to a measurement criterion.
The real-time detection method of few-sample targets based on transfer learning strategy according to claim 1, characterized in that: the preprocessing described in step S2 is: by using random affine transformation, multi-scale image training strategy, MixUp data Fusion strategy and Label Smoothing label processing strategy are used to process limited data.
The real-time detection method of few-sample targets based on transfer learning strategy according to claim 2, characterized in that: in step S3, the backbone network is initialized to the weights trained on the ImageNet data set, and the network model except the few-sample detection branch is used Large sample category data is trained from scratch. The loss function at this stage involves prediction box coordinates, target confidence and classification results. The loss function is:
L base training ＝L box +L cls +L obj (1)

Among them, L box is the additive combination of the GIoU loss function and smooth L1 loss of coordinate regression; L cls and L obj are the Focal Loss function and the binary cross-entropy loss function respectively.
The real-time detection method of few-sample targets based on transfer learning strategy according to claim 2, characterized in that: in step S4, the model parameters of the main part of the detection model, the detection neck part and the large-sample category detection branch part are frozen , only fine-tuning the few-sample category detection branch. The loss function at this stage involves the coordinates of the prediction box, target confidence, classification results and the difference of the large-sample category detection branch.
The real-time detection method of few-sample targets based on transfer learning strategy according to claim 5, characterized in that step S4 specifically includes the following steps:

S41: Establish a base class distillation loss L b between the large-sample category detection branch and the few-sample detection branch. The calculation formula is as follows:

Among them, N represents the batch size, l represents the absolute error function, and Represents the output of the i-th image in the large-sample detection branch and the few-sample category detection branch respectively;

S42: The loss function fine-tuned on a few samples is:
L few-shot tuning ＝L box +2L cls +L obj +λ·L b (3)

Among them, λ is used to control the impact of base class distillation loss on model gradient update;

S43: Add a discriminator after the large sample category detection branch and the few sample detection branch, and the discriminator selects the result of the large sample category detection branch and few-shot category detection branch results The maximum value between them is used as the final output, and its measurement criterion is as follows:

where O d (i, j) represents the discriminator output of a specific spatial grid.
The real-time detection method of few-sample targets based on transfer learning strategy according to claim 1, characterized in that: the new regularization method is the Attentive DropBlock algorithm, which has a dynamic coefficient γ, as shown below:

Among them, the parameters keep_prob and block_size affect the frequency and range of the feature map being set to zero, σ represents the sigmoid function, which is used to control the response range, and α represents the response amplification factor.
The real-time detection method of few-sample targets based on transfer learning strategy according to claim 7, characterized in that: the Attentive DropBlock algorithm first determines whether it is currently in the fine-tuning stage, and if the model is being fine-tuned, obtains the channel of the few-sample category detection branch Response f C and spatial response f S ; After that, after calculating the parameter γ according to the parameters keep_prob, block_size and α, the spatial position of each different channel feature is set to zero according to the Bernoulli distribution probability with parameter γ; finally , with the zero position as the center, construct a mask block with a length and width value of block_size, thereby regularizing the model.
The real-time detection method of few-sample targets based on transfer learning strategy according to claim 1, characterized in that: in step S5, training and testing are performed on the PASCAL VOC and MS COCO data sets;

For the PASCAL VOC data set, the training set and the verification set are first merged into one set for training to detect the magic heart, and then its test set is selected for testing. The test evaluation standard uses the average accuracy mean with an intersection-to-union ratio threshold of 0.5 and multiple The average number of frames processed per second of different few-sample sets represents the detection accuracy and speed of the detection model;

For the MS COCO data set, only its training set is used for training, and its validation set is used for verification. The IoU is from 0.5 to 0.95, the mAP with an interval of 0.05 and the number of transmission frames per second represent the detection accuracy and speed of the detection model.
The real-time detection method of few-sample targets based on transfer learning strategy according to claim 9, characterized in that: during the training process of step S5, stochastic gradient descent is used as the optimization method of the network model, and the initial learning rate is 1×10 -3 , and the set minimum batch size is 16 in different data sets; for the PASCAL VOC and MS COCO data sets, the number of times the detection model is trained and fine-tuned from scratch is 300, and the CosineLR learning rate change strategy is used during the training process, that is The learning rate ranges from 0.001 to 0.00001; during the prediction process, the length and width of the input image are fixed at 448×448; FPS is the sum of the waiting time to obtain each result and the time to post-process the results, mFPS is the time under different few sample sets FPS average.