CN113159116A

CN113159116A - Small sample image target detection method based on class interval balance

Info

Publication number: CN113159116A
Application number: CN202110262177.9A
Authority: CN
Inventors: 叶齐祥; 李博豪; 焦建彬; 杨博宇
Original assignee: University of Chinese Academy of Sciences
Current assignee: University of Chinese Academy of Sciences
Priority date: 2021-03-10
Filing date: 2021-03-10
Publication date: 2021-07-23

Abstract

The invention discloses a small sample image target detection method based on class spacing balance, which comprises a training stage and a testing stage, wherein the training stage comprises the following steps: step 1, training a network model by using base class data; and 2, carrying out fine tuning training on the network model by using the combination of the novel class data and the base class data. The small sample image target detection method based on class-spacing balance, disclosed by the invention, does not need a large amount of data annotation, reduces the manual annotation cost, reduces the aliasing among different types of prototype vectors, and improves the target detection precision of a neural network on an inquiry image; the method has important significance for small sample learning, incremental learning and the like, and has application value for target detection in the fields of natural scene images and the like.

Description

Small sample image target detection method based on class interval balance

Technical Field

The invention belongs to the technical field of small sample learning and computer vision, and particularly relates to a small sample image target detection method based on class spacing balance.

Background

In recent years, visual object detection has made great progress, mainly due to the availability of large-scale datasets with precise annotations and Convolutional Neural Networks (CNNs) capable of absorbing the annotation information. However, annotating a large number of objects is expensive and laborious, and this is inconsistent with cognitive learning, which can use little supervision to build accurate models.

Small sample object detection that mimics the way human learning has attracted increasing attention. Given a base class that is sufficient to train data and a novel class that rarely supervises a sample, a small sample target detects the training model to detect objects from both the base class and the novel class. For this reason, most work divides the training process into two phases: base class training (token learning) and novel class reconstruction (meta training). In the characterization learning, enough base class training data are used for training a network and constructing a representative feature space; in meta-training, the network is fine-tuned so that novel class objects can be represented within the feature space.

While significant advances have been made in small sample target detection methods, previous work has ignored the inherent contradiction between characterization and classification. Namely: in order to separate classes to reduce aliasing between classes, the distributions of the two base classes need to be far from each other (maximum margin) but to represent the novel classes accurately, the distributions of the base classes should be close to each other (minimum margin), which leads to difficulties in classification.

Therefore, how to optimize the characterization and classification of novel classes in the same feature space is a problem that needs to be solved.

Disclosure of Invention

In order to overcome the above problems, the present inventors have conducted intensive studies to design a small sample image object detection method based on class-spacing balance, which constructs a feature space during base class training to maximize the spacing between novel classes by introducing class-spacing loss; during the novel class reconstruction, a characteristic perturbation module is introduced by truncating the gradient map; through multiple training iterations, feature reconstruction and object classification are simultaneously achieved in the same feature space by equalizing class intervals in a antagonism min-max manner, thereby completing the present invention.

Specifically, the present invention aims to provide the following:

in a first aspect, a small sample image target detection method based on class-spacing balance is provided, the method includes a training phase and a testing phase, and the training phase includes the following steps:

step 1, training a network model by using base class data;

and 2, carrying out fine tuning training on the network model by using the combination of the novel class data and the base class data.

In a second aspect, a computer-readable storage medium is provided, which stores a class-spacing-balance-based small sample image object detection training program, and when the program is executed by a processor, the program causes the processor to execute the steps of the class-spacing-balance-based small sample image object detection method according to the first aspect.

In a third aspect, a computer device is provided, which includes a memory and a processor, the memory stores a small sample image object detection training program based on class-pitch balance, and the program, when executed by the processor, causes the processor to execute the steps of the small sample image object detection method based on class-pitch balance of the first aspect.

The invention has the advantages that:

(1) the invention provides a small sample image target detection method based on class interval balance, which discloses contraction of feature classification hidden in small sample target detection, provides a feasible method for relieving the contraction from the aspect of class interval balance (CME), and realizes feature reconstruction and target classification in the same feature space;

(2) the small sample image target detection method based on class interval balance provided by the invention provides a maximum edge loss and characteristic disturbance module, realizes class interval balance in a antagonism minimum-maximum mode, reduces aliasing among different types of prototype vectors, and improves target detection precision of a neural network on a query image;

(3) according to the small sample image target detection method based on class-space balance, the problem of small sample target detection is converted into the problem of small sample classification by filtering the positioning features in the prototype vector, so that the manual labeling cost is reduced, the method has great significance for small sample learning, incremental learning and the like, and has application value for target detection in the fields of natural scene images and the like.

Drawings

FIG. 1 illustrates a flow diagram for training a network model using base class large amounts of data in accordance with a preferred embodiment of the present invention;

FIG. 2 illustrates a fine tuning training flow diagram in accordance with a preferred embodiment of the present invention;

FIG. 3 illustrates prototype vector t-SNE visualizations generated by the baseline method of the embodiment and the method of the invention, wherein the novelty classes are shown in bold;

FIG. 4 shows the t-SNE visualization of the evolution of prototype vector features of two classes in the fine tuning phase in an embodiment, where the dashed line represents the eigen-disturbance path (forward propagation) and the solid line segment represents the path with the largest edge loss (backward propagation);

FIG. 5 is a graph showing comparison of the results of the baseline method and the method of the present invention, wherein the green frame is the correct frame, the blue frame is the missing frame compared to the correct result, and the red frame is the wrong frame.

Detailed Description

The present invention will be described in further detail below with reference to preferred embodiments and examples. The features and advantages of the present invention will become more apparent from the description.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

The inventor finds that in the small sample image target detection method, a feature space is constructed during base class training, and the distance between novel classes is maximized by introducing class distance loss; during the novel class reconstruction, a feature perturbation module is introduced. Feature space division for small sample target detection is optimized through antagonism class spacing regularization, and therefore feature reconstruction and target classification are achieved in the same feature space.

Based on the above, the first aspect of the present invention provides a method for detecting a small sample image target based on Class distance balance (CME), the method includes a training phase and a testing phase, the training phase includes the following steps, as shown in fig. 1 and 2:

step 1, training a network model by using base class data;

In the invention, the training phase is to train the neural network by detecting the target of the query set by using the support set under the condition of giving the support set (support image) and the query set (query image) with the same category.

The support set (support image) refers to an image with semantic labels, the query set (query image) refers to an image to be detected without labels, and the same category refers to the same category of semantic labels, such as sheep, cattle and the like.

The steps involved in the training phase are described further below:

step 1, training a network model by using base class data.

As shown in FIG. 1, the network model is trained using a base class of large amounts of data.

Preferably, step 1 comprises the sub-steps of:

step 1-1, extracting a feature map of the query image.

According to a preferred embodiment of the invention, a convolutional neural network base network is used to extract a feature map from a query image.

Wherein, the base net can be YOLO, darknet19, etc., preferably darknet 19. For example, the base network from which the query image was extracted is darknet19, the network input is a 3-dimensional image of 416 × 416 size, and the network output is a 13 × 13 feature map of 1024.

And 1-2, extracting a feature map of the support image and a corresponding prototype vector thereof.

According to a preferred embodiment of the invention, the feature map of the annotated support image is extracted using a lightweight convolutional neural network, and the prototype vector corresponding to the support image feature map is extracted using maximal pooling.

The lightweight convolutional neural network refers to a network computing mode designed to be more efficient, so that network parameters are reduced while network performance is not lost, and the lightweight convolutional neural network can be a Darknet network, a MobileNet network and the like.

In a further preferred embodiment, the prototype vector corresponding to the feature map of the support image is obtained by the following formula:

wherein f is_θS(. is a lightweight convolutional neural network that extracts features of the support image, I^SMask for object in image M^SIs used to support the image(s) of (1),

for embedding operation, I^SAnd M^SEmbedding to generate a four-dimensional image to be input to a neural network;

supporting the data set Scomposed

Wherein i is the index number corresponding to the category, k is the index number corresponding to the target in the category,

is a three-dimensional image of W x H,

to correspond to

W × H, W and H are preferably 416.

And 1-3, performing dimensionality reduction on the prototype vector to obtain a dimensionality-reduced prototype vector and obtain a loss value corresponding to the dimensionality-reduced prototype vector.

The inventor researches and discovers that the prototype vector corresponding to the support image feature map obtained through the step 1-2 mainly causes aliasing of different types of prototype vectors due to the fact that the prototype vector comprises the class features and the positioning features of the target in the support image, and therefore, according to a preferred embodiment of the invention, the dimension of the prototype vector is reduced by adopting a full-connection convolution layer, and the target detection problem is converted into the classification problem.

Preferably, the prototype vector is reduced in dimension by:

wherein the content of the first and second substances,

is a fully connected convolutional layer.

In a further preferred embodiment, the loss value corresponding to the reduced prototype vector is obtained by the maximum edge loss.

Preferably, the loss value corresponding to the reduced prototype vector is obtained by the following formula:

wherein the content of the first and second substances,

in order to maximize the edge loss,

the distance between the inner sides of the clusters is the similar inner distance,

the distance between the two adjacent lines is the similar distance,

μ′_iis a vector of a class-averaged prototype,

wherein, K is the number of prototype vectors corresponding to a class, and j is the index corresponding to the prototype vector in the class.

In the present invention, it is preferable to decouple local features that may mislead the class pitch in the feature space by introducing fully connected convolutional layers.

The method converts the problem of small sample target detection into the problem of small sample classification by filtering the positioning characteristics in the prototype vector, reduces the manual labeling cost, and has important significance for small sample learning, increment learning and the like.

And 1-4, performing feature activation on the query image to realize target detection on the query image.

The method comprises the steps of carrying out feature activation on an unmarked query image to be detected by adopting a prototype vector before dimensionality reduction, and realizing target detection on the query image.

The feature activation can be performed by a method commonly used in the prior art, and preferably, a prototype vector before dimension reduction and a corresponding feature of the extracted query image are multiplied channel by channel so that the prototype vector and the feature are fused to complete the feature activation.

And 1-5, updating parameters of the network model.

According to a preferred embodiment of the present invention, the detection result and the labeled detection loss are calculated, the gradient of the loss function is calculated, the error gradient is transmitted back to the network, and the network parameters are updated.

In a further preferred embodiment, the detection result and the annotated detection loss are obtained by the following formula:

wherein the content of the first and second substances,

representing a query image I^QExtracted corresponding feature map, θ_QRepresenting the extracted query image features corresponding to the network model parameters,

which means that the multiplication is performed channel by channel,

representing a prediction model, theta_PRepresenting the parameters of the prediction model, mu_iRepresenting the corresponding prototype vector, M, of the support image^QAnd representing the result of the corresponding test picture marking.

In the invention, the query image feature map after activating the prediction model is output

Namely a detection result graph of the network for the query image.

In a still further preferred embodiment, the detection loss function is obtained by the following formula:

wherein the content of the first and second substances,

in order to classify the function of the loss,

in order to be a function of the regression loss,

as an anchor confidence loss function.

Wherein the content of the first and second substances,

as an index function, to determine whether the current detection frame target is of class i,

to a confidence of corresponding class i, λ_classIs the classification loss function weight.

Wherein the content of the first and second substances,

predicting the coordinates of the upper left corner of the detection box and the width and height, x, of the box for the network, respectively_i，y_i，w_i，h_iRespectively the coordinates of the upper left corner of the corresponding real frame and the width and height of the frame, lambda_coordFor the regression loss function weights, l.w and l.h are the width and height of the final output feature map, and l.n is the number of test boxes corresponding to each cell of the final output feature map.

Wherein λ is_noobjThe weight, lambda, of the target case is not included in the Anchor confidence loss function_objThen the weights for the included target cases,

determining whether the current detection frame does not contain the target as class i, C for the index function_iThe confidence of the corresponding anchor is output for the network,

then is the true confidence level for the anchor.

Wherein, as shown in fig. 2, the network model is fine-tuned and trained by using a small amount of data containing a novel class in combination with the base class data.

Preferably, step 2 comprises the following sub-steps:

and 2-1, combining the novel class data and the base class data to form a new data set, then training a network model, and repeating the steps 1-1 to 1-5.

Because the novel class data and the base class data need to keep balance in quantity to obtain a better training effect, the combination of the novel class data and the base class data refers to: when there are only N novel class targets, the number of base class targets provided is also N.

And 2-2, obtaining new training data.

Wherein, the step 2-2 comprises the following substeps:

and 2-2-1, obtaining a gradient map corresponding to the support image with the label according to the neural network back transmission.

According to a preferred embodiment of the present invention, the gradient map corresponding to the annotated support image is obtained by:

wherein the content of the first and second substances,

represents a loss function in fine-tuning training,

the gradient map G (x, y) is W × H, and has the same size as I (x, y).

And 2-2-2, performing binarization processing on the gradient map, and multiplying the gradient map by a mask map corresponding to a target in the support image to obtain a new mask map serving as new training data.

According to a preferred embodiment of the invention, the new mask map is obtained by:

M^S′(x，y)＝M^S(x，y)·T(G(x，y))

wherein M is^S(x, y) denotes an original mask map, and T (G (x, y)) denotes a gradient map after binarization processing;

where τ is a dynamic threshold, generally set to: the gradient map is divided into two parts, namely a corresponding mask map area and a non-mask map area, wherein the mask map area sets the value of the mask map area corresponding to the Top 15% of the gradient value in the area to 0 (originally 1), and the non-mask map area is not changed.

And 2-3, training the network model by adopting new training data, and repeating the steps 1-5.

In the invention, during the training of the base class, a characteristic space is constructed, and the interval between novel classes is maximized by introducing class interval loss; during the novel class reconstruction, a feature perturbation module is introduced by truncating the gradient map. Through multiple iterative training, the class intervals are equalized in a mode of antagonism minimum-maximum, so that the class intervals are regularized, and feature reconstruction and target classification are simultaneously realized in the same feature space.

According to a preferred embodiment of the present invention, the testing stage is to apply the trained network model to a small sample target detection task of a novel class to verify the validity of the model.

In a further preferred embodiment, the data class of the testing phase is the same as the data class of the training phase, but the pictures are different.

In a further preferred embodiment, the number of supported images in each novel class of data (new category) is 1 (1shot) or more (raw shot);

preferably, when the number of the support images is multiple, the prototype vector extracted from each class is averaged to obtain the corresponding class prototype vector for target detection.

The small sample image target detection method based on class-spacing balance establishes a characteristic disturbance module for a small number of support images with labels to amplify data of mask images corresponding to targets in the images, extracts prototype vectors from the images through a convolutional neural network, and performs characteristic activation on query images to be detected without labels by using the prototype vectors to realize target detection on the query images. The method does not need a large amount of data marking, and reduces the manual marking cost; aliasing among different types of prototype vectors is reduced, and the target detection precision of the neural network on the query image is improved; the method has important significance for small sample learning, incremental learning and the like, and has application value for target detection in the fields of natural scene images and the like.

The invention also provides a computer readable storage medium, which stores a small sample image target detection training program based on class-spacing balance, and when the program is executed by a processor, the program causes the processor to execute the steps of the small sample image target detection method based on class-spacing balance.

The small sample image target detection method based on class-spacing balance can be realized by means of software plus a necessary general hardware platform, wherein the software is stored in a computer-readable storage medium (comprising a ROM/RAM, a magnetic disk and an optical disk) and comprises a plurality of instructions for enabling a terminal device (which can be a mobile phone, a computer, a server, a network device and the like) to execute the method.

The invention also provides computer equipment which comprises a memory and a processor, wherein the memory stores a small sample image object detection training program based on class-spacing balance, and the program causes the processor to execute the steps of the small sample image object detection method based on class-spacing balance when being executed by the processor.

Examples

The present invention is further described below by way of specific examples, which are merely exemplary and do not limit the scope of the present invention in any way.

Example 1

1. Data set

This example was evaluated on the commonly used pascal voc 2007, pascal voc 2012 and MSCOCO datasets.

The object types in the data set are divided into two types: a base class with a large number of annotations and a novel class with K (K-shot) instances of annotations. In the process of training the model by using a large amount of base class data, optimizing the network by using the training data of the base class; during the fine-tuning, the network is optimized by K instances of each novel class and base class.

For the Pascal VOC data set, the whole data set is divided into 3 groups for cross validation, 5 classes are selected as novel classes in each group, the other 15 classes are base classes, and the number K of annotated instances is set to be 1,2,3,5 and 10; for the MS COCO dataset, 20 classes were selected as novel classes, and the remaining 60 classes were set as base classes.

2. Task description

In the small sample target detection process, after completing network feature expression learning by using data of a training set in a data set, realizing target detection on an inquiry image by using a small number of images with labels, namely a support set, of a test set; and performing performance evaluation by using the mAP after the test is finished.

Among these, mAP references "Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll' ar, and C.Lawrence Zitnick. Microsoft COCO: common objects in context ECCV, pages 740-.

Specifically, the training phase:

(1) extracting a feature map of the query image: extracting a base network of the query image as darknet19, inputting a 3-dimensional image with the size of 416 × 416 by the network, and outputting a 13 × 13 feature map with the output of 1024 by the network;

(2) extracting a feature map of the support image and a corresponding prototype vector: prototype vector for supporting an image

Wherein f is_θS(. is a lightweight convolutional neural network that extracts features of the support image, I^SMask for object in image M^SThe support image of (a); supporting the data set Scomposed

is a three-dimensional image of W x H,

to correspond to

W × H one-dimensional mask map of (1);

for embedding operation, I^SAnd M^SEmbedding to generate a four-dimensional image to be input to a neural network, wherein W and H are both 416;

(3) and (3) carrying out dimensionality reduction on the prototype vector, and obtaining a loss value after dimensionality reduction: by passing

Converting the object detection problem into a classification problem, wherein

Is a full connection layer;

maximum edge loss

Wherein the class average prototype vector

Inner distance of class

Class spacing

The maximum edge loss can be calculated by the above equation.

(4) And performing feature activation on the query image to realize target detection on the query image.

(5) And calculating the detection result and the labeled detection loss, calculating the gradient of the loss function, performing error gradient back transmission on the network, and updating the network parameters.

The detection result and the labeled detection loss are obtained by the following formula:

wherein the content of the first and second substances,

which means that the multiplication is performed channel by channel,

representing a prediction model, theta_PRepresenting the parameters of the prediction model, mu_iRepresenting the corresponding prototype vector, M, of the support image^QRepresenting the result of the corresponding test picture marking;

the detection loss function is obtained by:

wherein the content of the first and second substances,

in order to classify the function of the loss,

in order to be a function of the regression loss,

as an anchor confidence loss function.

(6) And (5) training the network model by combining the novel class data and the base class data, and repeating the steps (1) to (5).

(7) New training data were obtained:

obtaining a corresponding gradient map of the support image with the label by the following formula:

wherein the content of the first and second substances,

represents a loss function in fine-tuning training,

represents the maximum edge loss, and the gradient map G (x, y) is W × H, and has the same size as I (x, y);

obtaining a new mask map M^S′(x，y)＝M^S(x, y). T (G (x, y)), where M^S(x, y) is the original mask map,

where τ is a dynamic threshold, typically set to: the gradient map is divided into two parts, corresponding masked and non-masked regionsThe mask map area sets the value of the mask map area corresponding to Top 15% of the gradient value in the area to 0 (originally 1), while the non-mask map area is not changed.

(8) And (5) training the network model by adopting new training data, and repeating the steps (1) to (5).

And (3) a testing stage: the number of the support images in each novel class data (new category) is 1 (1shot) or more (raw shot); when the number of the support images is multiple, the prototype vector extracted from each class is averaged to obtain the corresponding class prototype vector for target detection.

3. Results and analysis

The invention uses Meta YOLO and MPSR as base line, respectively, to study and evaluate on Pascal VOC, MS COCO database, the results are shown in tables 1 and 2 respectively.

TABLE 1 PASCALVOC data set object detection test Performance comparison

TABLE 2 comparison of target detection test performance results for MSCOCO database

Where APS is the small target average accuracy, APM is the medium target average accuracy, APL is the large target average accuracy, AR1 is the average recall rate for only one proposal area per graph, AR10 is the average recall rate for only ten proposal areas per graph, AR100 is the average recall rate for only one hundred proposal areas per graph, ARs is the average recall rate for small targets, ARM is the average recall rate for medium targets, and ARL is the average recall rate for large targets.

Wherein LSTD, Meta YOLO, Meta R-CNN, MetaDet, TFA w/cos, Viewpoint and MPSR are the latest technical methods for detecting the current small sample target, and CME is the method provided by the invention.

LSTD is specifically described in the literature "Hao Chen, Yali Wang, Guoyou Wang, and Yu Qiao.LSTD A low-shot transfer detector for object detection. in Sheila A.McIlraith and Kilian Q.Weinberger, editors, AAAI, pages 2836-2843,2018.7, 8";

meta Yolo is specifically described in the literature "Bingyi Kang, Zhuang Liu, Xin Wang, Fisher Yu, JianshiFeng, and Trevor Darrell. Few-shot object detection via feature repaging. in IEEE ICCV, pages 8419-8428,2019.1, 2,3,5,6,7, 8";

MetaDet is specifically described in the literature "Yu-Xiong Wang, Deva Ramanan, and Martial Hebert. Meta learning to detect ray objects. in IEEE ICCV, pages 9924-9933,2019.1, 2,7, 8";

meta R-CNN is described in particular in the literature "Xiaopeng Yan, Ziliang Chen, Anni Xu, Xiaoxi Wang, Xiaodan Liang, and Liang Lin. Meta R-CNN: towards general solvent for instance-level low-shot learning. in ECCV, pages 9576-9585. IEEE,2019.1,2,7, 8";

TFA w/cos is specifically described in the literature "Xin Wang, Thomas E.Huang, Trevor Darrell, Joseph E.Gonzalez, and Fisher Yu.Frustratingly simple fe-shot object detection. CoRR, abs/2003.06957,2020.1,3,7, 8";

the Viewpoint is specifically described in the document "Yang Xiao and Red Marlet. Few-shot object detection and Viewpoint estimation for objects in the world. in ECCV,2020.7, 8";

MPSR is specifically described in the literature "Jianxi Wu, Songtao Liu, Di Huang, and Yunhong Wang. Multi-scale positive sample definition for raw-shot object detection. in Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, ECCV, pages 456-472,2020.1, 3,5,7, 8".

In table 1, a comparison was made on the pascalloc dataset for CME and a first order small sample target detector based on a YOLO detector, including LSTD, Meta YOLO, and MetaDet. As can be seen from table 1, the CME detector (method) proposed by the present invention shows a great advantage over other detectors. Specifically, for packet 1(Novel set 1), CME achieves a 0.7% boost at 1shot setting (17.8% versus 17.1%), a 7.0% boost at 2shot setting (26.1% versus 19.1%), and a 2.6% boost at 3shot setting (31.5% versus 28.9%). The boost at the 5shot setting is 9.8% (44.8% compared to 35.0%) and the average boost is 3.8%, which is a large boost for the small sample target detection domain. The average performance improvement for packet 3 was also 0.7%.

In addition, table 1 shows the comparison of the proposed method of the present invention with two-stage small sample target detectors, including MetaDet, Meta RCNN, TFA, Viewpoint and MPSR, which are all based on the fast-RCNN framework. As can be seen from Table 1, CME performs better than the same type of detector in most cases. For packet 1, the CME achieves a 5% boost at the 2shot setting (47.5% versus 42.5%) and a 6% boost at the 5shot setting (58.2% versus 52.2%). The average lifting reaches 1.2 percent. In addition, the average performance of group 2 increased to 1.5% and the average performance of group 3 increased to 0.3%.

The MS COCO dataset has more object classes and images than the Pascal VOC dataset, which means that class-spacing balance may benefit from richer feature representations. As shown in table 2, the test on the MS COCO dataset achieved more significant relative improvement, for the 10shot setting, the CME improved the AP by 5.3% on the basis of the baseline method MPSR, and for the 30shot setting, the AP improved by 2.8%.

Further, fig. 3 and 5 show the comparison of the target detection situation of the method of the present invention and the Baseline method.

Wherein, fig. 3 shows the t-SNE visualization result of the prototype vector generated by the method of the invention and the Baseline method, and the novel class is displayed in bold.

FIG. 4 shows a t-SNE visualization of the evolution of prototype vector features of two classes when trained in the fine-tuning phase. Where the dashed line represents the characteristic interference path (forward propagation) and the solid line segment represents the path with the largest edge loss (backward propagation). In the fine-tuning phase, the instance features of the novel class form subspaces in the feature space learned on the base class.

Fig. 5 shows the target detection results of the method of the present invention and the Baseline method (Meta YOLO and MPSR), where the red box indicates the wrong detection result, the blue box indicates the missing detection result, and the green box indicates the correct detection result.

As can be seen from fig. 5, the method (CME) of the present invention can accurately detect more objects with less erroneous detection results.

For maximum edge loss, the small sample object detector can reduce false detection results due to the increased distance between each class, which is beneficial for the classifier to distinguish and obtain correct classification results. However, as the number of missing targets increases, the maximum edge loss does not facilitate feature reconstruction. By introducing class-spacing balancing, the method of the invention balances the contradiction between classification and characterization.

Further, comparative experiments were performed on each module of CME on the novel class of Pascal vocscope 1, with the results shown in table 3:

TABLE 3

Wherein, MM is maximum edge (Max Margin, MM), FF is Feature Filtering (FF) (i.e. the prototype vector is reduced in dimension in step (3)), and FD is Feature perturbation (FD); "√" indicates the inclusion of this module; avg Imp represents the average improvement accuracy of the method of the invention at a setting of 1-10 shot.

As can be seen from table 3, in the method described in this embodiment, the accuracy of model target detection is improved by all three modules.

Comparative experimental results for the number of output channels of the feature filtration module on the novel class of pascal voc split1 are shown in table 4:

TABLE 4

As can be seen from Table 4, the model performance improvement is highest when the number of channels of the prototype vector is reduced from 1024 dimensions to 512 dimensions.

The results of the class comparison experiments on the novel class of Pascal VOC split1 for the effect of the feature perturbation module are shown in Table 5:

TABLE 5

Wherein, C_NovelExpressed as acting on a novel class, C_BaseExpressed as acting on the base class.

As can be seen from Table 5, the performance is most improved when the feature perturbation module only acts on the domain base class.

The results of the different perturbation strategies for the characteristic perturbation module comparison experiments on the novel class of Pascal VOC split1 are shown in Table 6:

TABLE 6

The method comprises the steps of selecting a Random sample point for truncation, selecting a Random crop for truncation by a Random picture area, performing feature map value truncation according to the feature map value, and performing Gradient run value truncation according to the Gradient map value.

As can be seen from table 6, the characteristic perturbation mode in the characteristic perturbation module is the mode of gradient truncation, which has the highest performance improvement for the model.

The invention has been described in detail with reference to specific embodiments and illustrative examples, but the description is not intended to be construed in a limiting sense. Those skilled in the art will appreciate that various equivalent substitutions, modifications or improvements may be made to the technical solution of the present invention and its embodiments without departing from the spirit and scope of the present invention, which fall within the scope of the present invention.

Claims

1. A small sample image target detection method based on class-spacing balance is characterized by comprising a training phase and a testing phase, wherein the training phase comprises the following steps:

step 1, training a network model by using base class data;

2. The small sample image target detection method based on class-spacing balance according to claim 1, wherein step 1 comprises the following sub-steps:

step 1-1, extracting a feature map of a query image;

step 1-2, extracting a feature map of a support image and a prototype vector corresponding to the feature map;

step 1-3, performing dimensionality reduction on the prototype vector to obtain a dimensionality-reduced prototype vector and obtain a loss value corresponding to the dimensionality-reduced prototype vector;

step 1-4, performing feature activation on the query image to realize target detection on the query image;

and 1-5, updating parameters of the network model.

3. The small sample image object detection method based on class-spacing balance according to claim 2,

in step 1-2, the prototype vector corresponding to the feature map of the support image is obtained by the following formula:

supporting the data set Scomposed

I is the index number corresponding to the category, k is the index number corresponding to the target in the category,

is a three-dimensional image of W x H,

to correspond to

W × H, W and H are preferably 416.

4. The small sample image object detection method based on class-spacing balance according to claim 2,

in the step 1-3, the dimension of the prototype vector is reduced by adopting the fully-connected convolution layer, and the loss value corresponding to the prototype vector after dimension reduction is obtained through the maximum edge loss.

5. The method for detecting small sample image targets based on class-spacing balance according to claim 4, wherein the loss value corresponding to the prototype vector after dimensionality reduction is obtained by the following formula:

wherein the content of the first and second substances,

in order to maximize the edge loss,

the distance between the two adjacent lines is the similar distance,

μ′_iis a vector of a class-averaged prototype,

k is the number of prototype vectors corresponding to a class, and j is the index corresponding to the prototype vector in the class.

6. The small sample image object detection method based on class-spacing balance according to claim 1,

step 2 comprises the following substeps:

step 2-1, combining the novel class data and the base class data to form a new data set, then training a network model, and repeating the steps 1-1 to 1-5;

step 2-2, obtaining new training data;

7. The small sample image object detection method based on class-spacing balance according to claim 6,

step 2-2 comprises the following substeps:

step 2-2-1, obtaining a gradient map corresponding to the support image with the label according to the neural network back transmission;

and 2-2-2, performing binarization processing on the gradient map, and multiplying the gradient map by a mask map corresponding to a target in the support image to obtain new training data.

8. The small sample image object detection method based on class-spacing balance according to claim 1,

the testing stage is to apply the trained network model to a small sample target detection task of a novel class to verify the effectiveness of the model,

preferably, the data category used in the testing phase is the same as the data category used in the training phase, and the pictures are different.

9. A computer-readable storage medium, in which a class-pitch-balance-based small-sample image object detection training program is stored, which, when executed by a processor, causes the processor to perform the steps of the class-pitch-balance-based small-sample image object detection method according to one of claims 1 to 8.

10. A computer device comprising a memory and a processor, wherein the memory stores a class-spacing-balance-based small sample image object detection training program, which when executed by the processor causes the processor to perform the steps of the class-spacing-balance-based small sample image object detection method of one of claims 1 to 8.