CN109871903B

CN109871903B - Target detection method based on end-to-end deep network and counterstudy

Info

Publication number: CN109871903B
Application number: CN201910179602.0A
Authority: CN
Inventors: 韩光; 周旺; 杨超
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2019-03-11
Filing date: 2019-03-11
Publication date: 2022-08-26
Anticipated expiration: 2039-03-11
Also published as: CN109871903A

Abstract

A target detection method based on an end-to-end deep network and counterstudy. Based on SSD, the characteristics that the low convolution layer has a small local perception field are utilized, and the low-resolution high-semantic information feature map and the high-resolution low-semantic information feature map are fused through an inverse convolution structure, so that the aim of improving the average accuracy of a target detection algorithm is fulfilled. In addition, coarse-grained candidate frame information is obtained through an RPN, a binary classification judgment is added after a candidate frame is generated in a basic feature layer, and then further regression is carried out through a conventional regression branch, so that more accurate detection frame information is obtained. Meanwhile, for the problem that the detection effect of the SSD algorithm on the partially shielded target is poor, the method for realizing the partial shielding of the features by adding the shielding Mask (Mask) on the feature map is provided, so that the effect of resisting learning is achieved.

Description

Target detection method based on end-to-end deep network and antagonistic learning

Technical Field

The invention relates to the technical field of target detection, in particular to a target detection method based on an end-to-end deep network and counterstudy.

Background

With the continuous development of computer technology and the continuous increase of the demand of intelligent video analysis, target detection research has become one of the important and challenging research directions in the field of computer vision. Object detection is a prerequisite for a number of advanced visual tasks, including activity or event recognition, scene content understanding, and the like. Moreover, object detection is also applied to many practical tasks, such as intelligent video surveillance, content-based image retrieval, robotic navigation, augmented reality, and the like. The target detection has important significance for the computer vision field and practical application.

At present, mainstream target detection algorithms are mainly based on a deep learning model, and can be divided into two categories: (1) the two-stage detection algorithm divides the detection problem into two stages, firstly generates a candidate region, and then classifies the candidate region, wherein the typical representatives of the algorithms are R-CNN algorithms based on region propofol, such as R-CNN, Fast R-CNN and the like; (2) one-stage detection algorithm, which does not require a region pro-posal stage, directly generates class probability and position coordinate values of an object, comparing typical algorithms such as YOLO and SSD algorithms.

Although the accuracy of the current mainstream target detection technology is high for large and medium targets, the detection effect is poor for small targets and some targets which are partially shielded.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a target detection method based on an end-to-end deep network and counterstudy, aiming at the above deficiencies in the prior art. The method is based on the SSD, utilizes the characteristic that the low convolution layer has a small local perception field, and fuses the low-resolution high-semantic information characteristic graph and the high-resolution low-semantic information characteristic graph through the inverse convolution structure, so as to achieve the purpose of improving the average accuracy of the target detection algorithm. In addition, coarse-grained candidate frame information is obtained through an RPN, a binary classification judgment is added after a candidate frame is generated in a basic feature layer, and then further regression is carried out through a conventional regression branch, so that more accurate detection frame information is obtained. Meanwhile, for the problem that the detection effect of the SSD algorithm on the partially shielded target is poor, the method for realizing the partial shielding of the features by adding the shielding Mask (Mask) on the feature map is provided, so that the effect of resisting learning is achieved.

A target detection method based on an end-to-end deep network and counterstudy comprises the following steps:

step 1, introducing a reverse convolution structure into an SSD algorithm, fusing a low-resolution high-semantic information feature map and a high-resolution low-semantic information feature map by adopting reverse convolution, and increasing the feature extraction capability of a lower layer in a network;

step 2, obtaining coarse-grained candidate frame information through an RPN, adding a binary classification judgment after a candidate frame is generated in a basic feature layer, and then further regressing through a conventional regression branch to obtain more accurate detection frame information;

step 3, corresponding the candidate frames on different scales screened in the step 2 with the fusion layer generated in the step 1 after the features on different scales are fused, then, amplifying or reducing all the feature map areas corresponding to the candidate frames to a fixed size through ROIPooling operation, and realizing partial shielding of the features by adding shielding masks on the feature maps so as to achieve the effect of resisting learning;

and 4, enabling the feature map shielded by the shielding mask to go back and forth to the detection frame and the class through the two full connection layers and the Softmax classifier.

Further, the step 1 specifically comprises:

firstly, inputting image data into an SSD network for extracting image features, and selecting four feature maps with different resolutions in a network structure;

and then, carrying out deconvolution on the low-resolution high-semantic information feature map in the SSD, carrying out feature fusion on the feature map obtained through deconvolution and the original feature map, wherein the feature fusion mode is to carry out deconvolution operation on the feature map, transmit the high-scale information to the previous layer through deconvolution operation on the feature layer of the next layer, transmit the high-scale information layer by layer, and finally obtain four fusion layers with different resolutions.

Further, the step 2 is specifically as follows:

for the feature maps with four different resolutions generated in the step 1, candidate frames with different sizes are generated on the four feature maps, partial negative samples are removed according to the IOU between the candidate frame and the target real frame, two branches are finally obtained based on the 4-layer features, one branch is regressed by coordinates of the candidate frame, and the other branch is a two-branch of the candidate frame.

Further, the step 3 specifically includes:

firstly, corresponding four layers of fusion layers with different resolutions obtained in the step 1 with positive and negative sample candidate frames on the feature maps with different resolutions obtained in the step 2;

then, performing ROIPooling operation on the four fusion layers with different resolutions, and scaling the feature map size corresponding to the candidate frame to a uniform size;

a mask is then generated from the fully connected layer to determine which portions of the feature map should be occluded, and the difficult samples thus generated are preferably misjudged by the detector, and the mask is automatically adjusted according to the loss function.

Further, in the step 4, the feature map shielded by the shielding mask is returned to the detection frame and the class through two full connection layers and a Softmax classifier, and specifically, the Non-maximum suppression (NMS) and the confidence threshold are screened to obtain the final predicted class, probability and positioning result.

In conclusion, the invention fuses the high-resolution low-semantic information feature map and the low-resolution high-semantic information feature map by utilizing the inverse convolution, improves the small target detection capability of the SSD algorithm, and simultaneously introduces counterstudy to enhance the detection capability of the algorithm on partially shielded targets.

Drawings

Fig. 1 is a network structure diagram of a target detection method based on an end-to-end deep network and counterlearning according to the present invention.

Fig. 2 is a structural diagram of the deconvolution feature fusion proposed by the present invention.

FIG. 3 is an exemplary graph of the feature occluded by the binarization mask according to the present invention.

Detailed Description

The technical scheme of the invention is further explained in detail by combining the drawings in the specification.

step 1, introducing a reverse convolution structure into an SSD algorithm, fusing a low-resolution high-semantic information feature map and a high-resolution low-semantic information feature map by adopting reverse convolution, and increasing the feature extraction capability of a low layer in a network.

The step 1 specifically comprises the following steps:

firstly, inputting image data into an SSD network for extracting image features, and selecting four feature maps with different resolutions in a network structure.

And 2, obtaining coarse-grained candidate frame information through an RPN, adding a binary classification judgment after a candidate frame is generated in a basic feature layer, and then further regressing through a conventional regression branch to obtain more accurate detection frame information.

The step 2 is specifically as follows:

And 3, corresponding the candidate frames on different scales screened in the step 2 with the fusion layer generated in the step 1 after the features on different scales are fused, then, amplifying or reducing all the feature map areas corresponding to the candidate frames to a fixed size through ROIPooling operation, and realizing partial shielding of the features by adding shielding masks on the feature maps so as to achieve the effect of resisting learning.

The step 3 specifically comprises the following steps:

firstly, the four layers of fusion layers with different resolutions obtained in the step 1 correspond to the positive and negative sample candidate frames on the feature map with different resolutions obtained in the step 2.

And then, carrying out ROIPooling operation on the four fusion layers with different resolutions, and scaling the feature map size corresponding to the candidate frame to a uniform size.

And 4, enabling the feature map shielded by the shielding mask to go back and forth to a detection frame and a category through two full connection layers and a Softmax classifier, and specifically, screening a Non-maximum suppression (NMS) and a confidence threshold value through a Non-maximum value to obtain a final predicted category, probability and positioning result.

The above description is only a preferred embodiment of the present invention, and the scope of the present invention is not limited to the above embodiment, but equivalent modifications or changes made by those skilled in the art according to the present disclosure should be included in the scope of the present invention as set forth in the appended claims.

Claims

1. A target detection method based on end-to-end deep network and antagonistic learning is characterized in that: comprises the following steps:

the step 2 is specifically as follows:

for the feature maps with four different resolutions generated in the step 1, generating candidate frames with different sizes on the four feature maps, removing partial negative samples according to the IOU between the candidate frames and the target real frame, and finally obtaining two branches based on the 4-layer features, wherein one branch is a coordinate regression branch of the candidate frame, and the other branch is a classification branch of the candidate frame;

the step 3 specifically comprises the following steps:

firstly, corresponding four layers of fusion layers with different resolutions obtained in the step 1 with positive and negative sample candidate frames on feature maps with different resolutions obtained in the step 2;

then, a mask is generated through the full connection layer, and the parts of the characteristic diagram which should be shielded are determined, so that the generated difficult sample is misjudged by a detector, and the mask can be automatically adjusted according to a loss function;

2. The method of claim 1, wherein the method comprises the following steps: the step 1 specifically comprises the following steps:

3. The method of claim 1, wherein the method comprises the following steps: and 4, enabling the feature map shielded by the shielding mask to go back and forth to a detection frame and a category through two full connection layers and a Softmax classifier, and specifically, screening a Non-maximum suppression threshold and a confidence threshold to obtain a final predicted category, probability and positioning result.