CN111723852A

CN111723852A - Robust training method for target detection network

Info

Publication number: CN111723852A
Application number: CN202010480420.XA
Authority: CN
Inventors: 李涵生; 韩鑫; 亢宇鑫; 崔磊; 杨林
Original assignee: Hangzhou Diyingjia Technology Co ltd
Current assignee: Hangzhou Diyingjia Technology Co ltd
Priority date: 2020-05-30
Filing date: 2020-05-30
Publication date: 2020-09-29
Anticipated expiration: 2040-05-30
Also published as: CN111723852B

Abstract

The invention relates to a robust training method aiming at a target detection network, which comprises the following steps: acquiring a training sample, wherein a part of detection targets on the training sample carry artificial labeling frames; performing feature extraction on the training sample by using a target detection network, and generating a suggestion box on the training sample; marking original sampling labels on the suggestion boxes, wherein the original sampling labels comprise positive labels and negative labels; performing pooling operation on the positive label by adopting a pooling branch, and outputting a first region-of-interest characteristic; inputting the first region of interest characteristics into a mining network, wherein the mining network is a fully-connected neural network, and the mining network generates a new suggestion box label, namely a mining label; fusing the mining label and the original sampling label to generate a gold label; and using the gold label for training the target detection network.

Description

Robust training method for target detection network

Technical Field

The invention relates to the technical field of computer vision and target detection, in particular to a robust training method for a target detection network.

Background

In recent years, a Convolutional Neural Network (CNN) based object detection framework has become a powerful method for various computer vision tasks and has been widely applied to object localization and object statistics tasks. At the same time, the Convolutional Neural Network (CNN) based object detection framework has continued to improve and a number of excellent architectures have been proposed. Among them, a region-based detection framework (e.g., fasternn, FPN) including a pre-processing step proposed for a region is widely used due to its more accurate detection performance. At the same time, many approaches continue to improve the performance of feature extractors by optimizing their network architecture. However, how to enhance the training robustness under non-optimal parameters and the trainability of the network under various label qualities has been proposed little.

Disclosure of Invention

The present application is proposed to solve the above technical problem, and provides a robust training method for a target detection network.

According to an aspect of the present application, there is provided a robust training method for a target detection network, including: acquiring a training sample, wherein a part of detection targets on the training sample carry artificial labeling frames; performing feature extraction on the training sample by using a target detection network, and generating a suggestion box on the training sample; marking original sampling labels on the suggestion boxes, wherein the original sampling labels comprise positive labels and negative labels; performing pooling operation on the positive label by adopting a pooling branch, and outputting a first region-of-interest characteristic; inputting the first region of interest characteristics into a mining network, wherein the mining network is a fully-connected neural network, and the mining network generates a new suggestion box label, namely a mining label; fusing the mining label and the original sampling label to generate a gold label; and using the gold label for training the target detection network.

Compared with the prior art, the robust training method for the target detection network is adopted, the processes of suggestion frame mining and label fusion are added in the training process of the target detection network, the phenomenon that the suggestion frame is wrongly annotated or a sample has too many false positives due to the fact that a manual annotation frame is missing or the set threshold (the first threshold and the second threshold) is too high or too low is effectively overcome, and the anti-interference performance of the network training process is improved.

Drawings

The above and other objects, features and advantages of the present application will become more apparent by describing in more detail embodiments of the present application with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the principles of the application. In the drawings, like reference numbers generally represent like parts or steps.

FIG. 1 is a flow chart of a robust training method for a target detection network of the present invention;

FIG. 2 is a segmentation of the processing stage of FIG. 1;

FIG. 3 is some positive label examples generated when training on a sparse VOC2007 training set;

FIG. 4 is a comparison graph (1) of results of a target detection network obtained by a common training method and a training method proposed in the present application under sparse COCO;

fig. 5 is a comparison graph (2) of the results of the target detection network obtained by the common training method and the training method proposed in the present application under COCO.

Detailed Description

Hereinafter, example embodiments of the present application will be described in detail with reference to the accompanying drawings. It should be understood that the described embodiments are only some embodiments of the present application and not all embodiments of the present application, and that the present application is not limited by the example embodiments described herein.

Summary of the application

Taking the fasterRCNN in the target detection network as an example, the fasterRCNN generates a suggestion frame during training, then calculates the intersection ratio of the suggestion frame and the labeling frame, if the intersection ratio is larger than a manually set threshold value, the suggestion frame is marked with a category label (positive sample), otherwise, the suggestion frame is marked with a background label (negative sample), and the label is used as a positive sample and a negative sample to train the network. However, if the manual labeling frame in the image is missing, the suggestion frame will be labeled with an error label. In addition, if the manually set threshold is not optimal, the sampling performance of the positive and negative samples is affected, and if the threshold is set too high, too many positive samples are lost, so that the capability of the network for identifying the target is reduced; if the threshold is set too low, too many false positives will occur in the sampled samples, interfering with the normal training process of the network and affecting the final performance.

Aiming at the technical problems, the invention aims to improve the training robustness of the pathological image detection network under the training data with different labeling quality and non-optimal parameters. The core component of the present invention is a neural network named "mining network". The mining network is able to learn the characteristics of the positive samples and mine potential positive samples in the mined images. Since the mined positive samples typically contain positive samples that are lost due to non-optimal parameters and missing annotations. In this way, the excavated positive samples are merged with the originally sampled positive samples, and the lost positive proposals caused by improper manual parameter setting and sample lack can be found back.

Having described the general principles of the present application, various non-limiting embodiments of the present application will now be described with reference to the accompanying drawings.

Exemplary method

The robust training method for the target detection network, as shown in fig. 1, includes:

s10, obtaining a training sample, wherein a part of detection targets on the training sample carry artificial labeling boxes;

s20, performing feature extraction on the training sample by using a target detection network, and generating a suggestion box on the training sample;

the task of the target detection network is to locate and identify an object from an image, and the image space is an Euclidean space which is not an effective feature separable space, so that a feature extractor is needed to be used for feature combination of field pixels of the image, and features of a larger range and even the whole image are mapped to a high-dimensional separable space. Because the performance of the network is closely related to the separability of the feature space, the backbone network of the target detection network often utilizes a mainstream classification network that has been widely verified to extract and combine features. The classification networks are usually pre-trained on a large-scale public data set, so that the search range of a parameter space of network parameters is effectively limited by a transfer learning mode, and the training difficulty of the network on a new target detection task is further reduced. Therefore, in the invention, the classification model ResNet101 after pre-training is used as a backbone network to execute a feature extraction task.

S30, marking the original sample label on the suggestion box: judging whether the intersection ratio of the suggestion frame and the manual labeling frame is greater than a set first threshold value or not, if so, marking the suggestion frame as a positive label, judging whether the intersection ratio of the suggestion frame and the manual labeling frame is less than a set second threshold value or not, if so, marking the suggestion frame as a negative label, wherein the positive label and the negative label are original sampling labels;

the proposed boxes that are neither positive nor negative do not help the network training, and therefore the number of positive labels is crucial to the training of the detector.

S40, adopting two pooling branches, performing pooling operation on the suggestion boxes marked as positive labels respectively, and outputting a first region-of-interest feature and a second region-of-interest feature;

the feature map region corresponding to the positive label is also called a region of interest (RoI), and two parallel pooling branches are used to pool the feature map region corresponding to the RoI, and output a RoI feature (i.e., a first RoI feature) and a RoI feature (i.e., a second RoI feature) for mining respectively. The two parallel different branch structures of the RoI pooling ensure that the mining process does not interfere with the training process of the detector. And inputting the RoI characteristics (namely the second region of interest characteristics) into the target detection network, and outputting the result by the target detection network.

S50, inputting the first region of interest features into a mining network, wherein the mining network is a fully-connected neural network and generates a new suggestion box label, namely a mining label;

the mining network is a fully-connected neural network, the input of which is the RoI characteristic used for mining, and the hidden layer of which can be one or more layers. The mining network outputs a probability distribution (mining score) with the suggested box category activated by softmax, then the mining score is subjected to one-hot coding, and a suggested box mining label represented by m is generated.

This process can be expressed as:

m＝onehot[softmax(f_mining(x_roi))]，

where m represents a suggestion box mining tag, f_miningIs a mining network, x_roiIs the input RoI feature used for mining.

S60, fusing the mining label and the original sampling label to generate a gold label, wherein the gold label is used as a final label of the suggestion frame;

and (4) the label obtained by fusing the mining label of the suggestion box and the original sampling label is called a gold label, and the gold label is used as a real label for detection training. By generating gold tags through merging operations, it can be ensured that the performance of the probe is not affected even under worst case conditions (excavation network is invalid).

Specifically, the gold tag (g) is the union of the original sampling tag (a) and the suggestion box mining tag (m). Through the merge operation, some false negative tags (which should be positive but sampled negative) in the original sample tags will be corrected by the suggestion box mining tag. Thus, many positive tags that were lost due to improper manual thresholds and missing annotations will be recovered, and the tag merging process can be expressed as:

wherein, a_k0 indicates that the suggestion box indexed by k is assigned a negative label.

And S70, using the suggestion box corresponding to the gold label for training the target detection network. The total loss for network training is represented by the following equation:

Loss_Total＝L_mcls(p,g)+L_loc+L_mining(a,m)，

where p is the final output of the probability distribution with softmax activation, L_locIs a loss of positioning; l is_mclsIs the cross entropy loss, which can be expressed as:

wherein N is_clsIs the number of suggested boxes, p_iClassification probability distribution of suggestion boxes indexed by i, g, that are output by the FastRCNN branch_iThe method comprises the steps that a suggested frame gold label is indexed according to i, and an original sampling label is optimized through the gold label;

L_miningthe expression mining loss, i.e. the cross entropy loss of the training mining network, can be expressed as:

wherein, a_iDenotes a tag, m, indexed by i in the assignment tag_iIs the output indexed by i in the mining network. Obviously, the labels used to train the mining network are sampling labels. Typically, there are hundreds of recommendations with tags per training step, and therefore hundreds of tags in the sample tags, which ensures that the mining network can adequately learn the characteristics of positive tags, and up to this point, Loss can be used_TatalAnd training the whole target detection network. The loss function comprises classification loss and positioning loss, wherein the classification loss comprises cross entropy loss L_mclsAnd excavation loss L_miningThe algorithm of the positioning loss follows the conventional calculation method, and is not described herein.

As shown in fig. 2, a general R-CNN training process is shown by a dotted line in the figure, and a recommendation frame can be obtained by further correcting the position of a default recommendation region (e.g., "anchor point" in fasterncn), and then a class label (or background label) is assigned to the recommendation frame and used as a training sample for training a detector. The process of suggested frame mining and label fusion is added in the training process of the application, as shown by a chain line in fig. 2, the phenomenon that the suggested frame is wrongly marked or the sample is excessively false positive due to the fact that the manual marking frame is missing or the set threshold (the first threshold and the second threshold) is too high or too low is effectively overcome, and the anti-interference performance of the network training process is improved.

To verify the validity of this patent, experiments were performed on the paschaloc 2007 and MSCOCO2017 datasets. The paschaloc 2007 consisted of 5k training images and 5k test images for approximately 20 classes of subjects. The COCO data set contained about 11.8 ten thousand training images and 5k validation images, and was tested using the validation set. The sparse data set is created manually by deleting annotations randomly until only one annotation per training image per class, as shown in fig. 3(a) (sparse annotation). Sparse operations are only performed on the training set of PASCAL and the training set of COCO, and the test set of PASCAL and the validation set of COCO remain intact.

1. Experimental parameters and details:

in the experiment, FasterRCNN is adopted in the target detection network, a feature extractor is ResNet101, and ResNet101 is pre-trained on ImageNet. The number of training steps is 150k on PASCAL, 1500k on COCO and 1 for blocksize. The learning rate was initially set at 0.0001, divided by 10 at 60k/600k steps for PASCAL and 10 at 80k/800k steps for COCO. Zooming the image in the training process to make the length of the short side 600 pixels; the maximum length of the long side is 1000 pixels. In addition, the images are randomly flipped horizontally to enhance the training data. The intersection ratio IoU of the suggestion box with the annotation is higher than 0.5 and is assigned a positive label, otherwise, it is a negative label.

2. Quantitative results:

TABLE 1 fast-RCNN trained on PASCAL training set, and average accuracy (nAP) and Average Recall (AR) results evaluated on PASCAL test set

Data of	This patent	mAP	AR
				Sparse	×	58.5	73.4
Sparse	√	61.5	75.5
				Complete (complete)	×	68.5	82.7
Complete (complete)	√	68.9	83.4

Table 1 lists the results evaluated on the PASCAL2007 test set, and under the training of sparse PASCAL, the method of the patent improves the mapp (mean average precision) by 3.0% and the AR by 2.1%. Meanwhile, the method realizes 0.7 percent of AR on the original PASCAL

(AverageRecall ) improvement.

TABLE 2 mean accuracy Ap results using fast-RCNN trained on the MSCOCO training set and evaluated on the MSCOCO validation set

Data of	This patent	AP@0.5	AP	AP-s	AP-m	AP-l
							Sparse	×	25.4	14.9	2.1	13.2	27.7
Sparse	√	28.4	16.5	2.7	15.7	30.3
							Complete (complete)	×	34.0	19.6	4.5	20.7	33.0
Complete (complete)	√	36.5	20.6	5.3	22.2	34.0

Table 2 shows the results evaluated on the validation set of the COCO dataset, with the method of the invention increasing the AP trained on sparse COCO and complete COCO by 1.6% and 1.0%, respectively. In addition, the method of the invention improves the AP @0.5 by 3.0% and 2.5% respectively under sparse and complete COCO, and the AP @0.5 means the statistical result under a single threshold value of 0.5. AP-s, AP-m, AP-l are AP indices for small, medium and large targets, respectively.

3. And (3) robustness analysis:

TABLE 3 mean recall AR results using fast-RCNN trained on the MSCOCO training set and evaluated on the MSCOCO validation set

Data of	This patent	AP	AR-s	AR-m	AR-l
						Sparse	×	17.4	2.1	14.9	32.7
Sparse	√	19.7	2.8	18.0	37.0
						Complete (complete)	×	23.5	4.9	24.4	40.4
Complete (complete)	√	25.7	6.0	27.4	43.4

In Table 3, the AR results of the present invention (19.70 and 25.7AR) are not much improved over the original FasterRCNN (17.4 and 23.5). In this section, the training performance of the target detection network will be explored, as well as the effectiveness of the present invention at IoU thresholds under different conditions. The number of positive advice boxes at different IoU thresholds at the last iteration cycle on the PASCAL training set is counted and the average number of positive advice boxes per image is reported. At the same time, the mAP results of the networks trained on the PASCAL training set are given and evaluated on the test set.

TABLE 4 number of positive advice boxes averaged over last training period (different IoU thresholds), and mAP results evaluated by PASCAL test set

As shown in table 4, the maps results of the method of the present invention outperformed fasternn except that the threshold of IoU was 0.3. However, with the increasing IoU threshold, the method of the invention can achieve more significant mAP improvement, for example, when the IoU threshold is 0.6, 0.7 and 0.8, respectively, the mAP improvement of the method of the invention is 1.0%, 2.7% and 6.8%, respectively.

4. Qualitative results

In fig. 4 and 5, some of the test results generated by the method of this patent are illustrated, as compared to fast. Fasterncn trained on sparse COCO datasets tends to miss some objects (red dashed box), and the method of this patent largely avoids this error. Meanwhile, the method of the patent obtains more accurate prediction on the COCO data set.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein. The foregoing description has been presented for purposes of illustration and description. Furthermore, the description is not intended to limit embodiments of the application to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims

1. The robust training method for the target detection network is characterized by comprising the following steps:

acquiring a training sample, wherein a part of detection targets on the training sample carry artificial labeling frames;

performing feature extraction on the training sample by using a target detection network, and generating a suggestion box on the training sample;

marking original sampling labels on the suggestion boxes, wherein the original sampling labels comprise positive labels and negative labels;

performing pooling operation on the positive label by adopting a pooling branch, and outputting a first region-of-interest characteristic;

inputting the first region of interest characteristics into a mining network, wherein the mining network is a fully-connected neural network, and the mining network generates a new suggestion box label, namely a mining label;

fusing the mining label and the original sampling label to generate a gold label;

and using the gold label for training the target detection network.

2. The robust training method for the target detection network as claimed in claim 1, wherein the generation process of the mining tag comprises: inputting the first region of interest feature into a mining network, outputting the probability distribution with the suggested box category activated by softmax by the mining network, performing one-hot coding on the probability distribution, and generating the mining tag, which is specifically represented as:

m＝onehot[softmax(f_mining(x_roi))]，

wherein m represents a mining tag, f_miningRepresenting a mined network, x_roiRepresenting a first region of interest feature.

3. The robust training method for the target detection network as claimed in claim 1, wherein the gold label is a union of the original sampling labels and the mining label, and a false negative label, which is a positive label but marked as a negative label, in the original sampling labels is corrected by the mining label and restored to be a positive label through a merging operation;

the label merging process is represented as:

wherein, a_k0 denotes that the suggestion box indexed by k is marked as a negative label, g denotes a gold label, a denotes the original sample label, and m denotes the mining label.

4. The robust training method for an object detection network as claimed in claim 2, wherein the loss function for object detection network training is:

Loss_Total＝L_mcls(p,g)+L_loc+L_mining(a,m)

wherein p is the probability distribution with the suggested box category after activation by softmax, g represents the gold tag;

L_mclswhich represents the cross-entropy loss in the entropy domain,

wherein N is_clsIndicates the number of suggestion boxes, p_iRepresenting the probability distribution, g, indexed by i, with the category of suggestion boxes after activation by softmax_iA gold tag representing an index by i;

L_locindicating a loss of positioning;

L_miningwhich is indicative of a loss of excavation,