CN111767944A

CN111767944A - Deep learning-based single-stage detector design method suitable for multi-scale target detection

Info

Publication number: CN111767944A
Application number: CN202010462591.XA
Authority: CN
Inventors: 赵敏; 孙棣华; 陈宇浩
Original assignee: Chongqing University
Current assignee: Chongqing University
Priority date: 2020-05-27
Filing date: 2020-05-27
Publication date: 2020-10-13
Anticipated expiration: 2040-05-27
Also published as: CN111767944B

Abstract

The invention discloses a deep learning-based single-stage detector design method suitable for multi-scale target detection, which is used for constructing a feature pyramid with balanced and sufficient feature information at each level from three angles of semantic information, detail information and receptive field; on the other hand, in order to improve the recall rate of the detector to the extreme-scale target, the invention abandons the manual setting of the Anchor size and the length-width ratio parameters, and utilizes the network to learn the size and the distribution required by the Anchor by itself. According to the invention, the single-stage detector is redesigned from the angle set by the characteristic pyramid and the Anchor, so that the detection precision of the single-stage detector on the multi-scale target is improved, and the detection speed is considered at the same time.

Description

Deep learning-based single-stage detector design method suitable for multi-scale target detection

Technical Field

The invention discloses a design method of a single-stage detector suitable for multi-scale target detection, which can be effectively used in a detection scene with large variation of vehicle equal-scale range.

Background

Target detection, one of the most important basic tasks in the field of computer vision, is often applied to a plurality of downstream tasks such as target tracking, re-recognition and instance segmentation. In recent years, with the outbreak of deep learning, a target detection algorithm based on a convolutional neural network quickly dominates the first of all large detection lists in academic circles by virtue of the advantages of high speed, high precision, strong robustness and the like. However, due to the structural characteristics of convolution, it is determined that a convolutional neural network does not have scale and does not deform, and therefore, scale change between object instances becomes one of the difficulties of a target detection algorithm. On the other hand, the target detection algorithm based on the convolutional neural network can be divided into a single-stage detector and a multi-stage detector from the perspective of the existence of the region suggestion generation, the single-stage detector generated by the region suggestion is discarded, and the Anchor is directly classified and regressed, so that the real-time reasoning speed is realized, and the target detection algorithm has wide application in a real scene. But has the disadvantage of limited detection accuracy, especially in scenes with densely distributed multi-scale objects. Therefore, the precision of the single-stage detector is improved while the high-efficiency reasoning speed is ensured, and the method is one of the research hotspots in the field of target detection.

At present, a method for improving the multi-scale target detection effect mainly starts from the following three aspects, namely multi-scale training, characteristic pyramid building and characteristic receptive field. And multi-scale training, namely randomly changing the input resolution of a training image after a certain number of iterations to force the network to learn the characteristics of the target under various scales. When the SNIP is trained under each fixed scale, only the gradient of the target of the corresponding scale is returned, and the target which is too large or too small is ignored, and when in test, the detection is carried out on all scales, but only the detection result of the corresponding scale is reserved. The multi-scale training requires a large video memory and a long training time, and the multi-scale testing seriously reduces the reasoning speed. Constructing a feature pyramid is currently the most widely used method. The SSD utilizes the features with different resolutions in the backbone network to construct a feature pyramid to correspondingly detect the targets with different scales, and the FPN and the TDM additionally construct a top-down path to supplement the feature pyramid with unbalanced semantic information in the backbone network, thereby improving the multi-scale target detection effect. The top-down branch, while supplementing the semantic information of the shallow features, ignores the detailed information that the top-level features lack. The STDN, the PFPNet and other works break out of the conventional thought, and the multi-scale feature pyramid is obtained by utilizing structures such as SPP or DenseNet and the like with the aim of constructing a feature pyramid with balanced information. The characteristic pyramid method has a remarkable effect on improving multi-scale target detection, but it is noted that the complex characteristic pyramid introduces excessive parameter quantity and calculation quantity, so that the model reasoning speed is reduced. The characteristic receptive field, as the name suggests, is to improve the detection effect of the small target by expanding the receptive field of the shallow characteristic. Inspired by a receptive field structure in a human visual system, the RFBNet adds a hole convolution into an inclusion structure, designs a novel RFB module and embeds the RFB module into an SSD algorithm, however, the RFBNet only focuses on shallow features, ignores information supplement to high-level features, and limits multi-scale performance of the detector. The expressway has been developed rapidly in China from the 90 s in the 20 th century, and has an extremely important position and function in modern transportation by virtue of the inherent characteristics and advantages of the expressway. With more and more vehicles running on the expressway, various problems come after each other, and the first time is the traffic jam problem. Due to the occurrence of abnormal events such as traffic accidents, road maintenance and the like on the expressway, the originally quite limited expressway resources are difficult to be fully utilized, and further serious traffic jam and vehicle queuing problems are caused. Different from urban roads, vehicles on expressways generally have higher driving speeds, so once traffic jam occurs, serious consequences are often caused, the influence time of the jam is generally longer, and the problem of serious economic loss can be caused.

The current method for predicting the queuing length is improved on the basis of a queuing theory or a traffic wave model, wherein the patent CN106887141A obtains the queuing length of a road section on the basis of assuming that the vehicle arrival rate obeys a certain distribution according to the queuing length between each node by setting continuous flow collection nodes based on the queuing theory. The patent CN106571030A proposes a traffic wave model-based queuing length prediction method for a specific scene of a road intersection based on multi-source data acquired by floating cars, and although the method has a low requirement for the layout of detection equipment, it requires that a certain proportion of floating cars are required to be present on the road, which is obviously difficult to satisfy in most cases for an expressway. Meanwhile, most of the existing methods for predicting the queuing length aim at simpler and closed road environments such as intersections, but non-closed road scenes including ramp toll stations and the like exist on expressways, and relevant researches are lacked.

Therefore, by means of multi-source data which can be obtained on the highway, the influence range of the abnormal event and the change process of the queuing length are effectively analyzed and grasped, and the method is helpful for guiding a traffic manager to make a reasonable traffic control strategy, so that the improvement of the control and service level of the highway is an urgent need for the development of the current intelligent traffic system and is also a key and difficult problem of research.

Disclosure of Invention

In view of the above, the present invention provides a method for designing a single-stage detector suitable for multi-scale target detection based on deep learning. According to the analysis, aiming at the defects existing in the prior art, the single-stage detector is redesigned from the angle of the feature pyramid and the Anchor, so that the detection precision of the single-stage detector on the multi-scale target is improved, and the detection speed is considered at the same time. Specifically, the invention constructs a characteristic pyramid with balanced and sufficient characteristic information at each level from three angles of semantic information, detail information and receptive field. On the other hand, in order to improve the recall rate of the detector to the extreme-scale target, the invention abandons the manual setting of the Anchor size and the length-width ratio parameters, and utilizes the network to learn the size and the distribution required by the Anchor by itself.

The design method provided by the invention comprises the following steps.

The purpose of the invention is realized by the following technical scheme:

a design method of a single-stage detector suitable for multi-scale target detection based on deep learning comprises the following steps:

the method comprises the following steps: performing data enhancement on the image;

step two: the method for acquiring the feature f with high semantic, high detail and large receptive field comprises the following four parts:

1) obtaining a feature map f with sufficient semantic information by 32 times of down sampling from an input picture through a backbone network_c；

2) In parallel with the backbone network, 16 times of pooling downsampling is carried out on the input picture, and a feature map f of rich detail information of codes is obtained through a shallow network of a plurality of convolution modules_d；

3) To f_cAnd f_dCarrying out fusion: will f is_cUp-sampling to obtain

While using 1 × 1 convolution, ensure f_dAnd

the characteristic dimensions are completely consistent, pair

After Sigmoid operation and f_dMultiplication to obtain f_cd；

4) Will f is_cdInputting the data into a multi-branch hole convolution module ASPP to obtain a characteristic diagram f;

step three: based on the feature graph f obtained in the second step, dividing the features in the feature pyramid into two types according to the resolution ratio higher than the feature graph f and lower than the feature graph f, and constructing the feature pyramid suitable for multi-scale target detection by adopting different processing methods for the two types of features;

step four: the automatic generation of the Anchor, i.e. Guide Anchor, comprises the following three parts:

1) in the classification branch feature map f_cls1 × 1 convolution of a single channel is carried out, and the probability of whether an Anchor is placed at a certain position is obtained through Sigmoid operation;

2) in regression branch feature map f_regThe convolution of 1 × 1 followed by a double channel is used for calculating two parameters of width and height of the Anchor placed at a certain position;

3) feature map of Anchor width and height generated by regression branchCalculating the deviation of the convolution sampling point, respectively for f_clsAnd f_regAnd performing deformable convolution to obtain features for classifying and regressing the Anchor.

Step five: design of loss function

The loss function of the entire network is expressed as

Loss＝L_cls+L_reg+λ(L_loc+L_shape)

In the formula, L_locDenotes the loss of Anchor position, L_shapeRepresents the loss of Anchor-shaped branches, L_clsRepresenting the loss of classification of the predicted part of Anchor, L_regRepresents the regression loss of the predicted part of Anchor, and lambda is a weighting coefficient.

Further, the specific process of the step one is as follows:

1) randomly cutting an area from the training image, but ensuring that the cut area has a target;

2) randomly expanding the cut area by using zero pixels;

3) scaling the augmented picture to an input resolution size.

Further, the number of convolution modules in step 2) in the second step is 3.

Further, the specific process of the third step is as follows:

1) for the features of which the resolution in the feature pyramid is higher than the f acquired in the step two, the resolution of the f is enlarged by utilizing nearest neighbor up-sampling, and then the features are purified by using 1 multiplied by 1 convolution;

2) for the features with resolution less than or equal to f in the feature pyramid, the f is directly obtained by convolution with 3 multiplied by 3 of a specified step size.

Further, the method for calculating each loss in the step five is as follows:

1) in the Anchor generation part, the Anchor position Loss L is calculated by adopting Focal local in consideration of the fact that positive and negative samples are in a polar imbalance state_loc；

2) In the Anchor generation part, only the width and the height are considered, and the Loss L of the Anchor-shaped branch is calculated by using GIoU Loss_shape；

3) The prediction of Anchor comprises two parts of classification and regression, and the classification loss L_clsUsing a cross entropy loss function based on Softmax, regression loss L_regA Smooth L1 loss function was used.

Due to the adoption of the technical scheme, the invention has the following beneficial effects:

the invention considers the condition that the road detection equipment is sparsely distributed,

on one hand, a feature pyramid with balanced and sufficient feature information at all levels is constructed from three angles of semantic information, detail information and receptive field; on the other hand, in order to improve the recall rate of the detector to the extreme-scale target, the invention abandons the manual setting of the Anchor size and the length-width ratio parameters, and utilizes the network to learn the size and the distribution required by the Anchor by itself. According to the invention, the single-stage detector is redesigned from the angle set by the characteristic pyramid and the Anchor, so that the detection precision of the single-stage detector on the multi-scale target is improved, and the detection speed is considered at the same time.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.

Drawings

The drawings of the present invention are described below.

FIG. 1 is a schematic flow diagram of a single-stage detector suitable for multi-scale object detection.

Fig. 2 is a schematic diagram of a single-stage detector suitable for multi-scale target detection.

Detailed Description

The invention is further illustrated by the following figures and examples.

Example 1

As shown in fig. 1-2, the method for designing a single-stage detector suitable for multi-scale target detection based on deep learning provided in this embodiment includes the following steps:

the method comprises the following steps: the data enhancement is carried out on the image, and the data enhancement mainly comprises the following three parts:

1) firstly, randomly generating two parameters of width and height of a cutting area, and then randomly generating an upper left corner point of the cutting area, thereby obtaining the cutting area. The intersection ratio IoU of the clipping region and all the target boxes is calculated according to the following formula:

where area denotes an area of a frame, n denotes an intersection portion of two frames, and u denotes a union portion of two frames. If the minimum value IoU is less than a specified threshold (e.g. 0.5), it indicates that the overlap ratio of the cutting area and all the objects exceeds the specified threshold. If the condition is not met, repeatedly cutting for 50 times, and if the condition is not met, directly outputting the original drawing;

2) randomly generating two parameters of width and height of the extended picture, randomly generating an upper left corner point for placing a cutting area, filling the cutting area into the extended picture, and filling zero pixels in other parts;

3) randomly selecting one of the following five scaling modes: nearest neighbor interpolation, bilinear interpolation, pixel region correlation, bicubic interpolation in the 4 x 4 pixel domain and Lanczos interpolation in the 8 x 8 pixel domain, and scaling the expanded picture to a specified input resolution. And expanding and scaling after random cutting to generate more multi-scale targets, thereby enhancing the multi-scale expression of the model. In addition, if the test stage is the test stage, the step one is only to perform the current scaling part.

Step two: the method for acquiring the characteristics f of high semantics, high details and large receptive field mainly comprises the following four parts:

1) taking ResNet-50 as an example, removing the final global average pooling and full connection layer, using the obtained global average pooling and full connection layer as a backbone network, and obtaining a feature map f with sufficient semantic information sampled by 32 times after the features of the input picture are extracted through the backbone network_c；

2) In parallel with the backbone network, a shallow convolution neural network is designedTo supplement the detailed information via the network. Specifically, the convolution, the batch normalization layer (BN) and the activation function (ReLu) are combined into one convolution module, a plurality of (for example 3) convolution modules are stacked, an input picture is subjected to 16-time down-sampling by using the pooling layer and then is input into the convolution module, and a feature map f for encoding rich detail information is obtained_d；

3) Due to f_cSpatial resolution less than f_dThus, f needs to be paired before fusion_cPerforming 2 times of upsampling, combining the pixel points at the same position into a 2 × 2 matrix by taking 4 channels as a group, and f after upsampling_cDoubling the resolution and quadrupleing the number of channels achieves the goal of using the channel information to complement the spatial resolution, followed by a 1 × 1 convolution to make f_c ^upAnd f_dThe number of channels is equal. Last pair f_c ^upMultiplication to f using Sigmoid operation_dIn the above way, the feature f with sufficient semantic information and detail information is obtained_cdThe whole process is shown as the following formula:

wherein W represents the weight of the 1 × 1 convolutional layer;

4) will f is_cdInputting the data into a multi-branch hole convolution module ASPP (asynchronous serial port protocol), and further improving the receptive field of the characteristics, so that high-quality characteristics f with sufficient semantic information, rich detailed information and sufficient receptive field are obtained;

step three: the method comprises the following steps of constructing a characteristic pyramid suitable for multi-scale target detection, and mainly comprising the following two parts:

1) in order to obtain an 8-time down-sampling feature map in a feature pyramid, 2-time up-sampling is carried out on f, feature maps of 4 adjacent channels are recombined into features with spatial resolution expanded by 2 times, and meanwhile, 1 × 1 convolution is utilized to ensure that the number of channels of an output feature map is unified to 256;

2) obtaining a feature map in the feature pyramid as large as the resolution of f by using a 3 multiplied by 3 convolution of 256 channels for f; for the feature map with the resolution less than f, taking a 64-fold downsampling feature map as an example, f needs to be obtained by performing 4-fold downsampling again, so that the feature map is obtained by using 3 × 3 convolution cascade with two step sizes of 2, and so on. The feature pyramid sets up 5 levels in total, and the largest scale is 8 times of down-sampling feature map, and the smallest scale is 128 times of down-sampling feature map.

Step four: the automatic generation of Anchor mainly comprises the following three parts:

1) in the classification branch feature map f_clsTherefore, a training link needs to obtain an Anchor position label graph by utilizing the position of a target frame, and the basic principle of generating the label graph is that Anchor should be placed in the central area of the target frame, otherwise, no Anchor is placed in a pixel point far away from the target frame.

Wherein w and h represent the width and height of the target, the feature with the highest resolution in the feature pyramid is four times down-sampled, and generally, the reference Anchor size is considered for each point on the feature map to be 8 times down-sampled times, namely 32. The logarithm is taken and then 0.5 is added and rounded down in order to round it down. Therefore, when the target area is less than 32²Detecting by the characteristic diagram with the highest resolution; when the target area is 32²To 64²In the middle, the detection is carried out by the characteristic diagram with the second highest resolution, and so on. Then setting the area ratio of the hyper-parametric central region₁0.2 and neglected area ratio₂The target frame is represented by (x, y, w, h) where (x, y) is the center point and (w, h) is the width and height, 0.5. The central region CR, the neglected region IR and the outer region OR may be represented as follows:

CR＝(x,y,₁w,₁h)

IR＝(x,y,₂w,₂h)\CR

OR＝R\(x,y,₂w,₂h)

wherein, R represents the whole characteristic diagram space, and A \ B represents that the B area is deducted from A. In the CR area of the characteristic map, the tag map value of the Anchor position is 1, namely the Anchor is placed; in the IR region, regardless of the placement of the Anchor, i.e., without passing back the portion of the gradient; in the OR region, the Anchor position tag map value is 0, i.e. no Anchor is placed. When the CR areas of different targets are overlapped with other areas, the CR area is taken as a standard; when the IR region and the OR region overlap, the IR region is used as a standard. In short, from a priority perspective, CR > IR > OR. In addition, in order to alleviate the gradient contradiction between the feature pyramids, the CR region on the neighboring level feature map may also be considered as the IR region of the current feature map. In the testing link, the Anchor is set only for the position with the Anchor position score exceeding a specified threshold (such as 0.01);

2) to regression branch feature map f_regBy utilizing a two-channel 1 × 1 convolution, two parameters of Anchor width and height are calculated by the following formula:

w＝σ₁×e^dw,h＝σ₂×e^dh

where dw and dh denote the width and height parameters of the generation, respectively, and σ is a scale scaling variable, which can be learned or set manually, where for simplicity, σ is set to 8s, and s is the step size of the feature map. The purpose of the exponential form is to ensure that the Anchor is wide and high nonnegative. Similar to the previous step, in the training stage, the goal of Anchor matching on each feature point needs to be found, so that the current branch is optimized by using the loss function. For an Anchor variable a with unknown width and height_wh(x, y, w, h) and the target frame gt (x)_g,y_g,w_g,h_g) Defining the vrou as IoU where the Anchor and the target box are the maximum, and calculating the formula as follows:

clearly, w and h are infinite as two real numbers. Therefore, it is considered to enumerate w and h (for example, using 9 Anchor parameters set in RetinaNet), so as to obtain a target box corresponding to each Anchor and IoU. Next, a positive sample Anchor was obtained in two ways: one is to set the positive sample threshold pos to 0.5, and when IoU is greater than the positive sample threshold, Anchor is considered to be a positive sample. Secondly, considering each target independently, if the Anchor with IoU maximum exceeds 0.4, then the target is regarded as a positive sample. Finally, randomly sampling 128 positive samples, and calculating the difference between the optimized Anchor and the shape of the target frame by using a loss function;

3) considering the difference of the Anchor shapes on the same feature map, the feature map of the Anchor shapes in the previous step is convolved to be used as the deviation of sampling points, and the deformable convolution is respectively applied to the classification and regression features, taking 3 × 3 deformable convolution as an example, and the calculation process is shown as follows:

wherein, Δ p_nAnd the deviation of the sampling point of the convolution generated by the Anchor shape feature diagram is shown, W is a parameter of the deformable convolution, x is an original feature, and y is a new feature generated by the deformable convolution.

Step five: calculating the loss function of the whole network, which mainly comprises the following four parts:

1) the loss of the Anchor position part is calculated by using a cross entropy loss function:

L_loc＝-(ylgp+(1-y)lg(1-p))

where y and p represent the corresponding values in the Anchor position label map and the prediction map, respectively. According to the description in the fourth step, it can be known that the difference between the numbers of positive and negative samples in the Anchor position label graph is very large, so that the loss generated by the positive and negative samples needs to be weighted to relieve that the gradient direction is dominated by the negative sample due to the imbalance problem. P is defined by the formula_t：

Then L_loc＝-lgp_t. Further, the number and difficulty of positive and negative samples are balanced by the Focal length, as shown in the following formula:

L_loc＝-α_t(1-p_t)^rlgp_t，

wherein α is used to balance the imbalance of the number of positive and negative samples, (1-p)_t)^rThe method is used for balancing the imbalance of the number of difficult and easy samples, so that the network is more concentrated on the categories with small learning number and high difficulty;

2) anchor-shaped branch loss function: and according to the analysis in the fourth step, 128 positive samples Anchor are randomly selected, the predicted width and height are combined with the coordinate position of the current characteristic point, and the current characteristic point is restored to the original image, so that the specific position and shape of the Anchor are obtained. The partial loss function is calculated using a IoU-based method, as follows:

wherein the content of the first and second substances,

represents the generated Anchor box B and target box B^gtA penalty term of (2). The invention samples DIoU loss to calculate the partial regression loss, and the punishment term calculation formula is as follows:

where ρ is²(b,b^gt) Representing Anchor Box B and target Box B^gtEuclidean distance between the center points, c denotes B and B^gtThe diagonal length of the smallest outside rectangle. Using DIoU to AOptimizing the nchor-shaped branches to generate anchors which are more consistent with target distribution;

3) and (3) adopting a cross entropy loss function based on Softmax for the classification part of Anchor, wherein the formula is as follows:

wherein x is_jRepresenting the predicted value of the sample to the true class, C being the total number of classes, a very small number (e.g. 10)^-5) To prevent the back-off from being 0 when the denominator is less than the computer representation precision. In addition, in order to avoid the imbalance of the positive and negative samples, the algorithm sorts all the negative samples according to the loss, and only selects the negative samples three times as many as the positive samples for returning the gradient. The regression loss portion of Anchor was calculated using the Smooth L1 loss function as shown below:

wherein x represents the difference between the encoding amount and the prediction amount of the target frame relative to Anchor, and the encoding process is consistent with most detection algorithms (such as Faster R-CNN, SSD);

4) combining the three small steps, the loss function of the whole network is to generate a weighted sum of the loss generated by the Anchor part and the loss generated by the predicted part of the network, as shown in the following formula:

Loss＝L_cls+L_reg+λ(L_loc+L_shape)

where λ is the weighting factor for the two-part loss, and is typically taken to be 1.

Finally, the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions, and all of them should be covered in the protection scope of the present invention.

Claims

1. A design method of a single-stage detector suitable for multi-scale target detection based on deep learning is characterized by comprising the following steps:

3) To f_cAnd f_dCarrying out fusion: will f is_cUp-sampling to obtain

While using 1 × 1 convolution, ensure f_dAnd

the characteristic dimensions are completely consistent, pair

After Sigmoid operation and f_dMultiplication to obtain f_cd；

1) in the classification branch feature map f_cls1 × 1 convolution of the next single channel, anObtaining the probability of whether an Anchor is placed at a certain position through Sigmoid operation;

3) calculating the sampling point deviation of convolution by using the Anchor width and height characteristic diagram generated by regression branches, and respectively aligning f_clsAnd f_regPerforming deformable convolution to obtain features for classifying and regressing the Anchor;

step five: design of loss function

The loss function of the entire network is expressed as

Loss＝L_cls+L_reg+λ(L_loc+L_shape)

2. The design method according to claim 1, wherein the specific process of the first step is as follows:

2) randomly expanding the cut area by using zero pixels;

3) scaling the augmented picture to an input resolution size.

3. The design method according to claim 1 or 2, wherein the number of convolution modules in step 2) in step two is 3.

4. The design method according to claim 1, wherein the specific process of step three is as follows:

5. The design method according to claim 1, wherein the calculation method of each loss in the step five is as follows:

1) in the Anchor generating part, FocalLoss is adopted to calculate Anchor position loss L in consideration of the polar imbalance state of the positive and negative samples_loc；