CN111767944A - Deep learning-based single-stage detector design method suitable for multi-scale target detection - Google Patents

Deep learning-based single-stage detector design method suitable for multi-scale target detection Download PDF

Info

Publication number
CN111767944A
CN111767944A CN202010462591.XA CN202010462591A CN111767944A CN 111767944 A CN111767944 A CN 111767944A CN 202010462591 A CN202010462591 A CN 202010462591A CN 111767944 A CN111767944 A CN 111767944A
Authority
CN
China
Prior art keywords
anchor
loss
convolution
feature
design method
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010462591.XA
Other languages
Chinese (zh)
Other versions
CN111767944B (en
Inventor
赵敏
孙棣华
陈宇浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University
Original Assignee
Chongqing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University filed Critical Chongqing University
Priority to CN202010462591.XA priority Critical patent/CN111767944B/en
Publication of CN111767944A publication Critical patent/CN111767944A/en
Application granted granted Critical
Publication of CN111767944B publication Critical patent/CN111767944B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • G06V10/464Salient features, e.g. scale invariant feature transforms [SIFT] using a plurality of salient features, e.g. bag-of-words [BoW] representations
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a deep learning-based single-stage detector design method suitable for multi-scale target detection, which is used for constructing a feature pyramid with balanced and sufficient feature information at each level from three angles of semantic information, detail information and receptive field; on the other hand, in order to improve the recall rate of the detector to the extreme-scale target, the invention abandons the manual setting of the Anchor size and the length-width ratio parameters, and utilizes the network to learn the size and the distribution required by the Anchor by itself. According to the invention, the single-stage detector is redesigned from the angle set by the characteristic pyramid and the Anchor, so that the detection precision of the single-stage detector on the multi-scale target is improved, and the detection speed is considered at the same time.

Description

Deep learning-based single-stage detector design method suitable for multi-scale target detection
Technical Field
The invention discloses a design method of a single-stage detector suitable for multi-scale target detection, which can be effectively used in a detection scene with large variation of vehicle equal-scale range.
Background
Target detection, one of the most important basic tasks in the field of computer vision, is often applied to a plurality of downstream tasks such as target tracking, re-recognition and instance segmentation. In recent years, with the outbreak of deep learning, a target detection algorithm based on a convolutional neural network quickly dominates the first of all large detection lists in academic circles by virtue of the advantages of high speed, high precision, strong robustness and the like. However, due to the structural characteristics of convolution, it is determined that a convolutional neural network does not have scale and does not deform, and therefore, scale change between object instances becomes one of the difficulties of a target detection algorithm. On the other hand, the target detection algorithm based on the convolutional neural network can be divided into a single-stage detector and a multi-stage detector from the perspective of the existence of the region suggestion generation, the single-stage detector generated by the region suggestion is discarded, and the Anchor is directly classified and regressed, so that the real-time reasoning speed is realized, and the target detection algorithm has wide application in a real scene. But has the disadvantage of limited detection accuracy, especially in scenes with densely distributed multi-scale objects. Therefore, the precision of the single-stage detector is improved while the high-efficiency reasoning speed is ensured, and the method is one of the research hotspots in the field of target detection.
At present, a method for improving the multi-scale target detection effect mainly starts from the following three aspects, namely multi-scale training, characteristic pyramid building and characteristic receptive field. And multi-scale training, namely randomly changing the input resolution of a training image after a certain number of iterations to force the network to learn the characteristics of the target under various scales. When the SNIP is trained under each fixed scale, only the gradient of the target of the corresponding scale is returned, and the target which is too large or too small is ignored, and when in test, the detection is carried out on all scales, but only the detection result of the corresponding scale is reserved. The multi-scale training requires a large video memory and a long training time, and the multi-scale testing seriously reduces the reasoning speed. Constructing a feature pyramid is currently the most widely used method. The SSD utilizes the features with different resolutions in the backbone network to construct a feature pyramid to correspondingly detect the targets with different scales, and the FPN and the TDM additionally construct a top-down path to supplement the feature pyramid with unbalanced semantic information in the backbone network, thereby improving the multi-scale target detection effect. The top-down branch, while supplementing the semantic information of the shallow features, ignores the detailed information that the top-level features lack. The STDN, the PFPNet and other works break out of the conventional thought, and the multi-scale feature pyramid is obtained by utilizing structures such as SPP or DenseNet and the like with the aim of constructing a feature pyramid with balanced information. The characteristic pyramid method has a remarkable effect on improving multi-scale target detection, but it is noted that the complex characteristic pyramid introduces excessive parameter quantity and calculation quantity, so that the model reasoning speed is reduced. The characteristic receptive field, as the name suggests, is to improve the detection effect of the small target by expanding the receptive field of the shallow characteristic. Inspired by a receptive field structure in a human visual system, the RFBNet adds a hole convolution into an inclusion structure, designs a novel RFB module and embeds the RFB module into an SSD algorithm, however, the RFBNet only focuses on shallow features, ignores information supplement to high-level features, and limits multi-scale performance of the detector. The expressway has been developed rapidly in China from the 90 s in the 20 th century, and has an extremely important position and function in modern transportation by virtue of the inherent characteristics and advantages of the expressway. With more and more vehicles running on the expressway, various problems come after each other, and the first time is the traffic jam problem. Due to the occurrence of abnormal events such as traffic accidents, road maintenance and the like on the expressway, the originally quite limited expressway resources are difficult to be fully utilized, and further serious traffic jam and vehicle queuing problems are caused. Different from urban roads, vehicles on expressways generally have higher driving speeds, so once traffic jam occurs, serious consequences are often caused, the influence time of the jam is generally longer, and the problem of serious economic loss can be caused.
The current method for predicting the queuing length is improved on the basis of a queuing theory or a traffic wave model, wherein the patent CN106887141A obtains the queuing length of a road section on the basis of assuming that the vehicle arrival rate obeys a certain distribution according to the queuing length between each node by setting continuous flow collection nodes based on the queuing theory. The patent CN106571030A proposes a traffic wave model-based queuing length prediction method for a specific scene of a road intersection based on multi-source data acquired by floating cars, and although the method has a low requirement for the layout of detection equipment, it requires that a certain proportion of floating cars are required to be present on the road, which is obviously difficult to satisfy in most cases for an expressway. Meanwhile, most of the existing methods for predicting the queuing length aim at simpler and closed road environments such as intersections, but non-closed road scenes including ramp toll stations and the like exist on expressways, and relevant researches are lacked.
Therefore, by means of multi-source data which can be obtained on the highway, the influence range of the abnormal event and the change process of the queuing length are effectively analyzed and grasped, and the method is helpful for guiding a traffic manager to make a reasonable traffic control strategy, so that the improvement of the control and service level of the highway is an urgent need for the development of the current intelligent traffic system and is also a key and difficult problem of research.
Disclosure of Invention
In view of the above, the present invention provides a method for designing a single-stage detector suitable for multi-scale target detection based on deep learning. According to the analysis, aiming at the defects existing in the prior art, the single-stage detector is redesigned from the angle of the feature pyramid and the Anchor, so that the detection precision of the single-stage detector on the multi-scale target is improved, and the detection speed is considered at the same time. Specifically, the invention constructs a characteristic pyramid with balanced and sufficient characteristic information at each level from three angles of semantic information, detail information and receptive field. On the other hand, in order to improve the recall rate of the detector to the extreme-scale target, the invention abandons the manual setting of the Anchor size and the length-width ratio parameters, and utilizes the network to learn the size and the distribution required by the Anchor by itself.
The design method provided by the invention comprises the following steps.
The purpose of the invention is realized by the following technical scheme:
a design method of a single-stage detector suitable for multi-scale target detection based on deep learning comprises the following steps:
the method comprises the following steps: performing data enhancement on the image;
step two: the method for acquiring the feature f with high semantic, high detail and large receptive field comprises the following four parts:
1) obtaining a feature map f with sufficient semantic information by 32 times of down sampling from an input picture through a backbone networkc
2) In parallel with the backbone network, 16 times of pooling downsampling is carried out on the input picture, and a feature map f of rich detail information of codes is obtained through a shallow network of a plurality of convolution modulesd
3) To fcAnd fdCarrying out fusion: will f iscUp-sampling to obtain
Figure BDA0002511521580000031
While using 1 × 1 convolution, ensure fdAnd
Figure BDA0002511521580000032
the characteristic dimensions are completely consistent, pair
Figure BDA0002511521580000033
After Sigmoid operation and fdMultiplication to obtain fcd
4) Will f iscdInputting the data into a multi-branch hole convolution module ASPP to obtain a characteristic diagram f;
step three: based on the feature graph f obtained in the second step, dividing the features in the feature pyramid into two types according to the resolution ratio higher than the feature graph f and lower than the feature graph f, and constructing the feature pyramid suitable for multi-scale target detection by adopting different processing methods for the two types of features;
step four: the automatic generation of the Anchor, i.e. Guide Anchor, comprises the following three parts:
1) in the classification branch feature map fcls1 × 1 convolution of a single channel is carried out, and the probability of whether an Anchor is placed at a certain position is obtained through Sigmoid operation;
2) in regression branch feature map fregThe convolution of 1 × 1 followed by a double channel is used for calculating two parameters of width and height of the Anchor placed at a certain position;
3) feature map of Anchor width and height generated by regression branchCalculating the deviation of the convolution sampling point, respectively for fclsAnd fregAnd performing deformable convolution to obtain features for classifying and regressing the Anchor.
Step five: design of loss function
The loss function of the entire network is expressed as
Loss=Lcls+Lreg+λ(Lloc+Lshape)
In the formula, LlocDenotes the loss of Anchor position, LshapeRepresents the loss of Anchor-shaped branches, LclsRepresenting the loss of classification of the predicted part of Anchor, LregRepresents the regression loss of the predicted part of Anchor, and lambda is a weighting coefficient.
Further, the specific process of the step one is as follows:
1) randomly cutting an area from the training image, but ensuring that the cut area has a target;
2) randomly expanding the cut area by using zero pixels;
3) scaling the augmented picture to an input resolution size.
Further, the number of convolution modules in step 2) in the second step is 3.
Further, the specific process of the third step is as follows:
1) for the features of which the resolution in the feature pyramid is higher than the f acquired in the step two, the resolution of the f is enlarged by utilizing nearest neighbor up-sampling, and then the features are purified by using 1 multiplied by 1 convolution;
2) for the features with resolution less than or equal to f in the feature pyramid, the f is directly obtained by convolution with 3 multiplied by 3 of a specified step size.
Further, the method for calculating each loss in the step five is as follows:
1) in the Anchor generation part, the Anchor position Loss L is calculated by adopting Focal local in consideration of the fact that positive and negative samples are in a polar imbalance stateloc
2) In the Anchor generation part, only the width and the height are considered, and the Loss L of the Anchor-shaped branch is calculated by using GIoU Lossshape
3) The prediction of Anchor comprises two parts of classification and regression, and the classification loss LclsUsing a cross entropy loss function based on Softmax, regression loss LregA Smooth L1 loss function was used.
Due to the adoption of the technical scheme, the invention has the following beneficial effects:
the invention considers the condition that the road detection equipment is sparsely distributed,
on one hand, a feature pyramid with balanced and sufficient feature information at all levels is constructed from three angles of semantic information, detail information and receptive field; on the other hand, in order to improve the recall rate of the detector to the extreme-scale target, the invention abandons the manual setting of the Anchor size and the length-width ratio parameters, and utilizes the network to learn the size and the distribution required by the Anchor by itself. According to the invention, the single-stage detector is redesigned from the angle set by the characteristic pyramid and the Anchor, so that the detection precision of the single-stage detector on the multi-scale target is improved, and the detection speed is considered at the same time.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.
Drawings
The drawings of the present invention are described below.
FIG. 1 is a schematic flow diagram of a single-stage detector suitable for multi-scale object detection.
Fig. 2 is a schematic diagram of a single-stage detector suitable for multi-scale target detection.
Detailed Description
The invention is further illustrated by the following figures and examples.
Example 1
As shown in fig. 1-2, the method for designing a single-stage detector suitable for multi-scale target detection based on deep learning provided in this embodiment includes the following steps:
the method comprises the following steps: the data enhancement is carried out on the image, and the data enhancement mainly comprises the following three parts:
1) firstly, randomly generating two parameters of width and height of a cutting area, and then randomly generating an upper left corner point of the cutting area, thereby obtaining the cutting area. The intersection ratio IoU of the clipping region and all the target boxes is calculated according to the following formula:
Figure BDA0002511521580000051
where area denotes an area of a frame, n denotes an intersection portion of two frames, and u denotes a union portion of two frames. If the minimum value IoU is less than a specified threshold (e.g. 0.5), it indicates that the overlap ratio of the cutting area and all the objects exceeds the specified threshold. If the condition is not met, repeatedly cutting for 50 times, and if the condition is not met, directly outputting the original drawing;
2) randomly generating two parameters of width and height of the extended picture, randomly generating an upper left corner point for placing a cutting area, filling the cutting area into the extended picture, and filling zero pixels in other parts;
3) randomly selecting one of the following five scaling modes: nearest neighbor interpolation, bilinear interpolation, pixel region correlation, bicubic interpolation in the 4 x 4 pixel domain and Lanczos interpolation in the 8 x 8 pixel domain, and scaling the expanded picture to a specified input resolution. And expanding and scaling after random cutting to generate more multi-scale targets, thereby enhancing the multi-scale expression of the model. In addition, if the test stage is the test stage, the step one is only to perform the current scaling part.
Step two: the method for acquiring the characteristics f of high semantics, high details and large receptive field mainly comprises the following four parts:
1) taking ResNet-50 as an example, removing the final global average pooling and full connection layer, using the obtained global average pooling and full connection layer as a backbone network, and obtaining a feature map f with sufficient semantic information sampled by 32 times after the features of the input picture are extracted through the backbone networkc
2) In parallel with the backbone network, a shallow convolution neural network is designedTo supplement the detailed information via the network. Specifically, the convolution, the batch normalization layer (BN) and the activation function (ReLu) are combined into one convolution module, a plurality of (for example 3) convolution modules are stacked, an input picture is subjected to 16-time down-sampling by using the pooling layer and then is input into the convolution module, and a feature map f for encoding rich detail information is obtainedd
3) Due to fcSpatial resolution less than fdThus, f needs to be paired before fusioncPerforming 2 times of upsampling, combining the pixel points at the same position into a 2 × 2 matrix by taking 4 channels as a group, and f after upsamplingcDoubling the resolution and quadrupleing the number of channels achieves the goal of using the channel information to complement the spatial resolution, followed by a 1 × 1 convolution to make fc upAnd fdThe number of channels is equal. Last pair fc upMultiplication to f using Sigmoid operationdIn the above way, the feature f with sufficient semantic information and detail information is obtainedcdThe whole process is shown as the following formula:
Figure BDA0002511521580000061
wherein W represents the weight of the 1 × 1 convolutional layer;
4) will f iscdInputting the data into a multi-branch hole convolution module ASPP (asynchronous serial port protocol), and further improving the receptive field of the characteristics, so that high-quality characteristics f with sufficient semantic information, rich detailed information and sufficient receptive field are obtained;
step three: the method comprises the following steps of constructing a characteristic pyramid suitable for multi-scale target detection, and mainly comprising the following two parts:
1) in order to obtain an 8-time down-sampling feature map in a feature pyramid, 2-time up-sampling is carried out on f, feature maps of 4 adjacent channels are recombined into features with spatial resolution expanded by 2 times, and meanwhile, 1 × 1 convolution is utilized to ensure that the number of channels of an output feature map is unified to 256;
2) obtaining a feature map in the feature pyramid as large as the resolution of f by using a 3 multiplied by 3 convolution of 256 channels for f; for the feature map with the resolution less than f, taking a 64-fold downsampling feature map as an example, f needs to be obtained by performing 4-fold downsampling again, so that the feature map is obtained by using 3 × 3 convolution cascade with two step sizes of 2, and so on. The feature pyramid sets up 5 levels in total, and the largest scale is 8 times of down-sampling feature map, and the smallest scale is 128 times of down-sampling feature map.
Step four: the automatic generation of Anchor mainly comprises the following three parts:
1) in the classification branch feature map fclsTherefore, a training link needs to obtain an Anchor position label graph by utilizing the position of a target frame, and the basic principle of generating the label graph is that Anchor should be placed in the central area of the target frame, otherwise, no Anchor is placed in a pixel point far away from the target frame.
Figure BDA0002511521580000071
Wherein w and h represent the width and height of the target, the feature with the highest resolution in the feature pyramid is four times down-sampled, and generally, the reference Anchor size is considered for each point on the feature map to be 8 times down-sampled times, namely 32. The logarithm is taken and then 0.5 is added and rounded down in order to round it down. Therefore, when the target area is less than 322Detecting by the characteristic diagram with the highest resolution; when the target area is 322To 642In the middle, the detection is carried out by the characteristic diagram with the second highest resolution, and so on. Then setting the area ratio of the hyper-parametric central region10.2 and neglected area ratio2The target frame is represented by (x, y, w, h) where (x, y) is the center point and (w, h) is the width and height, 0.5. The central region CR, the neglected region IR and the outer region OR may be represented as follows:
CR=(x,y,1w,1h)
IR=(x,y,2w,2h)\CR
OR=R\(x,y,2w,2h)
wherein, R represents the whole characteristic diagram space, and A \ B represents that the B area is deducted from A. In the CR area of the characteristic map, the tag map value of the Anchor position is 1, namely the Anchor is placed; in the IR region, regardless of the placement of the Anchor, i.e., without passing back the portion of the gradient; in the OR region, the Anchor position tag map value is 0, i.e. no Anchor is placed. When the CR areas of different targets are overlapped with other areas, the CR area is taken as a standard; when the IR region and the OR region overlap, the IR region is used as a standard. In short, from a priority perspective, CR > IR > OR. In addition, in order to alleviate the gradient contradiction between the feature pyramids, the CR region on the neighboring level feature map may also be considered as the IR region of the current feature map. In the testing link, the Anchor is set only for the position with the Anchor position score exceeding a specified threshold (such as 0.01);
2) to regression branch feature map fregBy utilizing a two-channel 1 × 1 convolution, two parameters of Anchor width and height are calculated by the following formula:
w=σ1×edw,h=σ2×edh
where dw and dh denote the width and height parameters of the generation, respectively, and σ is a scale scaling variable, which can be learned or set manually, where for simplicity, σ is set to 8s, and s is the step size of the feature map. The purpose of the exponential form is to ensure that the Anchor is wide and high nonnegative. Similar to the previous step, in the training stage, the goal of Anchor matching on each feature point needs to be found, so that the current branch is optimized by using the loss function. For an Anchor variable a with unknown width and heightwh(x, y, w, h) and the target frame gt (x)g,yg,wg,hg) Defining the vrou as IoU where the Anchor and the target box are the maximum, and calculating the formula as follows:
Figure BDA0002511521580000081
clearly, w and h are infinite as two real numbers. Therefore, it is considered to enumerate w and h (for example, using 9 Anchor parameters set in RetinaNet), so as to obtain a target box corresponding to each Anchor and IoU. Next, a positive sample Anchor was obtained in two ways: one is to set the positive sample threshold pos to 0.5, and when IoU is greater than the positive sample threshold, Anchor is considered to be a positive sample. Secondly, considering each target independently, if the Anchor with IoU maximum exceeds 0.4, then the target is regarded as a positive sample. Finally, randomly sampling 128 positive samples, and calculating the difference between the optimized Anchor and the shape of the target frame by using a loss function;
3) considering the difference of the Anchor shapes on the same feature map, the feature map of the Anchor shapes in the previous step is convolved to be used as the deviation of sampling points, and the deformable convolution is respectively applied to the classification and regression features, taking 3 × 3 deformable convolution as an example, and the calculation process is shown as follows:
Figure BDA0002511521580000082
Figure BDA0002511521580000083
Figure BDA0002511521580000084
wherein, Δ pnAnd the deviation of the sampling point of the convolution generated by the Anchor shape feature diagram is shown, W is a parameter of the deformable convolution, x is an original feature, and y is a new feature generated by the deformable convolution.
Step five: calculating the loss function of the whole network, which mainly comprises the following four parts:
1) the loss of the Anchor position part is calculated by using a cross entropy loss function:
Lloc=-(ylgp+(1-y)lg(1-p))
where y and p represent the corresponding values in the Anchor position label map and the prediction map, respectively. According to the description in the fourth step, it can be known that the difference between the numbers of positive and negative samples in the Anchor position label graph is very large, so that the loss generated by the positive and negative samples needs to be weighted to relieve that the gradient direction is dominated by the negative sample due to the imbalance problem. P is defined by the formulat
Figure BDA0002511521580000085
Then Lloc=-lgpt. Further, the number and difficulty of positive and negative samples are balanced by the Focal length, as shown in the following formula:
Lloc=-αt(1-pt)rlgpt
Figure BDA0002511521580000091
wherein α is used to balance the imbalance of the number of positive and negative samples, (1-p)t)rThe method is used for balancing the imbalance of the number of difficult and easy samples, so that the network is more concentrated on the categories with small learning number and high difficulty;
2) anchor-shaped branch loss function: and according to the analysis in the fourth step, 128 positive samples Anchor are randomly selected, the predicted width and height are combined with the coordinate position of the current characteristic point, and the current characteristic point is restored to the original image, so that the specific position and shape of the Anchor are obtained. The partial loss function is calculated using a IoU-based method, as follows:
Figure BDA0002511521580000092
wherein the content of the first and second substances,
Figure BDA0002511521580000093
represents the generated Anchor box B and target box BgtA penalty term of (2). The invention samples DIoU loss to calculate the partial regression loss, and the punishment term calculation formula is as follows:
Figure BDA0002511521580000094
where ρ is2(b,bgt) Representing Anchor Box B and target Box BgtEuclidean distance between the center points, c denotes B and BgtThe diagonal length of the smallest outside rectangle. Using DIoU to AOptimizing the nchor-shaped branches to generate anchors which are more consistent with target distribution;
3) and (3) adopting a cross entropy loss function based on Softmax for the classification part of Anchor, wherein the formula is as follows:
Figure BDA0002511521580000095
wherein x isjRepresenting the predicted value of the sample to the true class, C being the total number of classes, a very small number (e.g. 10)-5) To prevent the back-off from being 0 when the denominator is less than the computer representation precision. In addition, in order to avoid the imbalance of the positive and negative samples, the algorithm sorts all the negative samples according to the loss, and only selects the negative samples three times as many as the positive samples for returning the gradient. The regression loss portion of Anchor was calculated using the Smooth L1 loss function as shown below:
Figure BDA0002511521580000096
wherein x represents the difference between the encoding amount and the prediction amount of the target frame relative to Anchor, and the encoding process is consistent with most detection algorithms (such as Faster R-CNN, SSD);
4) combining the three small steps, the loss function of the whole network is to generate a weighted sum of the loss generated by the Anchor part and the loss generated by the predicted part of the network, as shown in the following formula:
Loss=Lcls+Lreg+λ(Lloc+Lshape)
where λ is the weighting factor for the two-part loss, and is typically taken to be 1.
Finally, the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions, and all of them should be covered in the protection scope of the present invention.

Claims (5)

1. A design method of a single-stage detector suitable for multi-scale target detection based on deep learning is characterized by comprising the following steps:
the method comprises the following steps: performing data enhancement on the image;
step two: the method for acquiring the feature f with high semantic, high detail and large receptive field comprises the following four parts:
1) obtaining a feature map f with sufficient semantic information by 32 times of down sampling from an input picture through a backbone networkc
2) In parallel with the backbone network, 16 times of pooling downsampling is carried out on the input picture, and a feature map f of rich detail information of codes is obtained through a shallow network of a plurality of convolution modulesd
3) To fcAnd fdCarrying out fusion: will f iscUp-sampling to obtain
Figure FDA0002511521570000011
While using 1 × 1 convolution, ensure fdAnd
Figure FDA0002511521570000012
the characteristic dimensions are completely consistent, pair
Figure FDA0002511521570000013
After Sigmoid operation and fdMultiplication to obtain fcd
4) Will f iscdInputting the data into a multi-branch hole convolution module ASPP to obtain a characteristic diagram f;
step three: based on the feature graph f obtained in the second step, dividing the features in the feature pyramid into two types according to the resolution ratio higher than the feature graph f and lower than the feature graph f, and constructing the feature pyramid suitable for multi-scale target detection by adopting different processing methods for the two types of features;
step four: the automatic generation of the Anchor, i.e. Guide Anchor, comprises the following three parts:
1) in the classification branch feature map fcls1 × 1 convolution of the next single channel, anObtaining the probability of whether an Anchor is placed at a certain position through Sigmoid operation;
2) in regression branch feature map fregThe convolution of 1 × 1 followed by a double channel is used for calculating two parameters of width and height of the Anchor placed at a certain position;
3) calculating the sampling point deviation of convolution by using the Anchor width and height characteristic diagram generated by regression branches, and respectively aligning fclsAnd fregPerforming deformable convolution to obtain features for classifying and regressing the Anchor;
step five: design of loss function
The loss function of the entire network is expressed as
Loss=Lcls+Lreg+λ(Lloc+Lshape)
In the formula, LlocDenotes the loss of Anchor position, LshapeRepresents the loss of Anchor-shaped branches, LclsRepresenting the loss of classification of the predicted part of Anchor, LregRepresents the regression loss of the predicted part of Anchor, and lambda is a weighting coefficient.
2. The design method according to claim 1, wherein the specific process of the first step is as follows:
1) randomly cutting an area from the training image, but ensuring that the cut area has a target;
2) randomly expanding the cut area by using zero pixels;
3) scaling the augmented picture to an input resolution size.
3. The design method according to claim 1 or 2, wherein the number of convolution modules in step 2) in step two is 3.
4. The design method according to claim 1, wherein the specific process of step three is as follows:
1) for the features of which the resolution in the feature pyramid is higher than the f acquired in the step two, the resolution of the f is enlarged by utilizing nearest neighbor up-sampling, and then the features are purified by using 1 multiplied by 1 convolution;
2) for the features with resolution less than or equal to f in the feature pyramid, the f is directly obtained by convolution with 3 multiplied by 3 of a specified step size.
5. The design method according to claim 1, wherein the calculation method of each loss in the step five is as follows:
1) in the Anchor generating part, FocalLoss is adopted to calculate Anchor position loss L in consideration of the polar imbalance state of the positive and negative samplesloc
2) In the Anchor generation part, only the width and the height are considered, and the Loss L of the Anchor-shaped branch is calculated by using GIoU Lossshape
3) The prediction of Anchor comprises two parts of classification and regression, and the classification loss LclsUsing a cross entropy loss function based on Softmax, regression loss LregA Smooth L1 loss function was used.
CN202010462591.XA 2020-05-27 2020-05-27 Single-stage detector design method suitable for multi-scale target detection based on deep learning Active CN111767944B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010462591.XA CN111767944B (en) 2020-05-27 2020-05-27 Single-stage detector design method suitable for multi-scale target detection based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010462591.XA CN111767944B (en) 2020-05-27 2020-05-27 Single-stage detector design method suitable for multi-scale target detection based on deep learning

Publications (2)

Publication Number Publication Date
CN111767944A true CN111767944A (en) 2020-10-13
CN111767944B CN111767944B (en) 2023-08-15

Family

ID=72719742

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010462591.XA Active CN111767944B (en) 2020-05-27 2020-05-27 Single-stage detector design method suitable for multi-scale target detection based on deep learning

Country Status (1)

Country Link
CN (1) CN111767944B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112417990A (en) * 2020-10-30 2021-02-26 四川天翼网络服务有限公司 Examination student violation behavior identification method and system
CN112529005A (en) * 2020-12-11 2021-03-19 西安电子科技大学 Target detection method based on semantic feature consistency supervision pyramid network
CN112633162A (en) * 2020-12-22 2021-04-09 重庆大学 Rapid pedestrian detection and tracking method suitable for expressway outfield shielding condition
CN112733929A (en) * 2021-01-07 2021-04-30 南京工程学院 Improved method for detecting small target and shielded target of Yolo underwater image
CN113052170A (en) * 2021-03-22 2021-06-29 江苏东大金智信息系统有限公司 Small target license plate recognition method under unconstrained scene
CN113221754A (en) * 2021-05-14 2021-08-06 深圳前海百递网络有限公司 Express waybill image detection method and device, computer equipment and storage medium
CN113780358A (en) * 2021-08-16 2021-12-10 华北电力大学(保定) Real-time hardware fitting detection method based on anchor-free network
CN114189876A (en) * 2021-11-17 2022-03-15 北京航空航天大学 Flow prediction method and device and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180165551A1 (en) * 2016-12-08 2018-06-14 Intel Corporation Technologies for improved object detection accuracy with multi-scale representation and training
CN109344821A (en) * 2018-08-30 2019-02-15 西安电子科技大学 Small target detecting method based on Fusion Features and deep learning
CN109584227A (en) * 2018-11-27 2019-04-05 山东大学 A kind of quality of welding spot detection method and its realization system based on deep learning algorithm of target detection
CN110807384A (en) * 2019-10-24 2020-02-18 华东计算技术研究所(中国电子科技集团公司第三十二研究所) Small target detection method and system under low visibility
CN111008619A (en) * 2020-01-19 2020-04-14 南京智莲森信息技术有限公司 High-speed rail contact net support number plate detection and identification method based on deep semantic extraction

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180165551A1 (en) * 2016-12-08 2018-06-14 Intel Corporation Technologies for improved object detection accuracy with multi-scale representation and training
CN109344821A (en) * 2018-08-30 2019-02-15 西安电子科技大学 Small target detecting method based on Fusion Features and deep learning
CN109584227A (en) * 2018-11-27 2019-04-05 山东大学 A kind of quality of welding spot detection method and its realization system based on deep learning algorithm of target detection
CN110807384A (en) * 2019-10-24 2020-02-18 华东计算技术研究所(中国电子科技集团公司第三十二研究所) Small target detection method and system under low visibility
CN111008619A (en) * 2020-01-19 2020-04-14 南京智莲森信息技术有限公司 High-speed rail contact net support number plate detection and identification method based on deep semantic extraction

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
BINQI FAN等: "FFBNET:LIGHTWEIGHT BACKBONE FOR OBJECT DETECTION BASED FEATURE FUSION BLOCK", 《2019 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP)》, pages 3920 - 3924 *
QIUSHAN GUO等: "MSFD: Multi-scale receptive field face detector", 《2018 24TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR)》, pages 1869 - 1874 *
乔延军: "面向户外增强现实的地理实体目标识别与检测", 《中国优秀硕士学位论文全文数据库 (信息科技辑)》, no. 08, pages 140 - 8 *
张庆伍等: "基于Anchor-free架构的行人检测方法", 《信息技术与网络安全》, pages 59 - 63 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112417990A (en) * 2020-10-30 2021-02-26 四川天翼网络服务有限公司 Examination student violation behavior identification method and system
CN112529005A (en) * 2020-12-11 2021-03-19 西安电子科技大学 Target detection method based on semantic feature consistency supervision pyramid network
CN112529005B (en) * 2020-12-11 2022-12-06 西安电子科技大学 Target detection method based on semantic feature consistency supervision pyramid network
CN112633162A (en) * 2020-12-22 2021-04-09 重庆大学 Rapid pedestrian detection and tracking method suitable for expressway outfield shielding condition
CN112633162B (en) * 2020-12-22 2024-03-22 重庆大学 Pedestrian rapid detection and tracking method suitable for expressway external field shielding condition
CN112733929A (en) * 2021-01-07 2021-04-30 南京工程学院 Improved method for detecting small target and shielded target of Yolo underwater image
CN113052170A (en) * 2021-03-22 2021-06-29 江苏东大金智信息系统有限公司 Small target license plate recognition method under unconstrained scene
CN113052170B (en) * 2021-03-22 2023-12-26 江苏东大金智信息系统有限公司 Small target license plate recognition method under unconstrained scene
CN113221754A (en) * 2021-05-14 2021-08-06 深圳前海百递网络有限公司 Express waybill image detection method and device, computer equipment and storage medium
CN113780358A (en) * 2021-08-16 2021-12-10 华北电力大学(保定) Real-time hardware fitting detection method based on anchor-free network
CN114189876A (en) * 2021-11-17 2022-03-15 北京航空航天大学 Flow prediction method and device and electronic equipment
CN114189876B (en) * 2021-11-17 2023-11-21 北京航空航天大学 Flow prediction method and device and electronic equipment

Also Published As

Publication number Publication date
CN111767944B (en) 2023-08-15

Similar Documents

Publication Publication Date Title
CN111767944A (en) Deep learning-based single-stage detector design method suitable for multi-scale target detection
CN109977812B (en) Vehicle-mounted video target detection method based on deep learning
CN110147763B (en) Video semantic segmentation method based on convolutional neural network
CN111612008B (en) Image segmentation method based on convolution network
CN112418236B (en) Automobile drivable area planning method based on multitask neural network
CN113313706B (en) Power equipment defect image detection method based on detection reference point offset analysis
CN111898432B (en) Pedestrian detection system and method based on improved YOLOv3 algorithm
CN111860683B (en) Target detection method based on feature fusion
CN115731533A (en) Vehicle-mounted target detection method based on improved YOLOv5
CN115035361A (en) Target detection method and system based on attention mechanism and feature cross fusion
CN112990065A (en) Optimized YOLOv5 model-based vehicle classification detection method
CN114332620A (en) Airborne image vehicle target identification method based on feature fusion and attention mechanism
CN112101117A (en) Expressway congestion identification model construction method and device and identification method
CN112257793A (en) Remote traffic sign detection method based on improved YOLO v3 algorithm
CN111259796A (en) Lane line detection method based on image geometric features
CN111832453A (en) Unmanned scene real-time semantic segmentation method based on double-path deep neural network
CN114913498A (en) Parallel multi-scale feature aggregation lane line detection method based on key point estimation
CN115082672A (en) Infrared image target detection method based on bounding box regression
CN114120280A (en) Traffic sign detection method based on small target feature enhancement
CN111738114A (en) Vehicle target detection method based on anchor-free accurate sampling remote sensing image
CN114332921A (en) Pedestrian detection method based on improved clustering algorithm for Faster R-CNN network
CN115482518A (en) Extensible multitask visual perception method for traffic scene
CN115527096A (en) Small target detection method based on improved YOLOv5
CN112949635B (en) Target detection method based on feature enhancement and IoU perception
CN116630932A (en) Road shielding target detection method based on improved YOLOV5

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant