CN114821665A

CN114821665A - Urban pedestrian flow small target detection method based on convolutional neural network

Info

Publication number: CN114821665A
Application number: CN202210574388.0A
Authority: CN
Inventors: 产思贤; 俞敏明; 赖周年
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2022-05-24
Filing date: 2022-05-24
Publication date: 2022-07-29

Abstract

The invention discloses a city pedestrian flow small target detection method based on a convolutional neural network, which comprises the steps of conducting Mosaic data enhancement and MixUp data enhancement on an image training data set labeled with a portrait small target detection frame, adjusting the enhanced image training data set to be the size of an input picture, inputting the picture into a backbone network to obtain feature maps with four sizes of the backbone network, inputting the feature maps with four sizes into a feature fusion network BIAFPN for feature processing, respectively transmitting the fused feature maps into corresponding prediction heads, respectively conducting convolution of classification branches and regression branches, then connecting along a channel part, stretching the connected feature maps into one dimension, then connecting the stretched feature maps to obtain a final feature map, calculating loss, conducting back propagation and updating network parameters, and completing network training. The invention introduces shallow fine-grained characteristics and then adopts a characteristic fusion network to detect the target, thereby effectively improving the precision of detecting the urban portrait small target.

Description

Urban pedestrian flow small target detection method based on convolutional neural network

Technical Field

The application belongs to the technical field of deep learning image processing, and particularly relates to a method for detecting urban pedestrian flow small targets based on a convolutional neural network.

Background

Target detection is a basic problem of machine vision, supports visual tasks such as instance segmentation, target tracking and action recognition, and is widely applied to the fields of automobile automatic driving, satellite images, monitoring and the like. Most of existing target detection algorithms adopt a method with an anchor frame, but the problem of imbalance of positive and negative samples is often caused, small target detection is more difficult, and how to improve small target detection precision is still a difficult problem of current detection.

The mainstream technical scheme for target detection at present comprises a one-stage algorithm and a two-stage algorithm. A two-stage mainstream algorithm such as FasterR-CNN series firstly screens a large number of candidate regions possibly having targets, and then detects the candidate regions. The mainstream algorithm of a stage, such as the YOLO series, directly completes the end-to-end prediction, and the model detection speed is faster, but the object detection precision is reduced to a certain extent.

Disclosure of Invention

The method aims to provide a convolutional neural network-based urban pedestrian flow small target detection method, a shallow information layer is introduced into the original YOLOX technical scheme to be multi-scale, and a better characteristic fusion mode BIAFPN is adopted, so that the problem of low urban portrait small target detection precision is solved.

In order to achieve the purpose, the technical scheme of the application is as follows:

a method for detecting urban pedestrian flow small targets based on a convolutional neural network comprises the following steps:

acquiring an image training data set with a portrait small target detection frame, and performing Mosaic data enhancement and MixUp data enhancement on the image training data set;

adjusting the enhanced image training data set to be the input image size, inputting the input image training data set into the backbone network CSPDarknet-53, and acquiring a feature map F with four sizes output by a dark2 unit, a dark3 unit, a dark4 unit and a dark5 unit in the backbone network CSPDarknet-53 ₁ 、F ₂ 、F ₃ 、F ₄ ；

Feature map F of four sizes ₁ 、F ₂ 、F ₃ 、F ₄ Inputting the data into a feature fusion network BIAFPN to perform feature processing to obtain a fused feature map F ₁₂ 、F ₂₂ 、F ₃₂ 、F ₄₂ ；

The fused feature map F ₁₂ 、F ₂₂ 、F ₃₂ 、F ₄₂ Respectively transmitting the data into corresponding prediction heads, respectively performing convolution of classification branches and regression branches, then connecting the classification branches and the regression branches along a channel part, stretching the connected feature graph into one dimension to obtain a stretched feature graph F ₁₃ 、F ₂₃ 、F ₃₃ 、F ₄₃ Then drawing the feature map F ₁₃ 、F ₂₃ 、F ₃₃ 、F ₄₃ Connecting to obtain a final characteristic diagram, calculating loss, performing back propagation to update network parameters, and finishing training of the network;

and inputting the image to be detected into the trained network to obtain a detection result.

Further, the Mosaic data enhancement includes:

taking out 4 images, and splicing the images in a random scaling, random cutting and random arrangement mode;

the MixUp data enhancement comprises the following steps: the 2 images were superimposed together.

Further, the feature map F with four sizes ₁ 、F ₂ 、F ₃ 、F ₄ Inputting the data into a feature fusion network BIAFPN to perform feature processing to obtain a fused feature map F ₁₂ 、F ₂₂ 、F ₃₂ 、F ₄₂ Comprises that：

Will feature chart F ₁ Directly inputting the data into a BIAFPN of a feature fusion network, firstly, performing convolution from top to bottom by 1 multiplied by 1, and upsampling the data and a feature map F ₂ Performing adaptive feature fusion to obtain a feature map F ₂₁ (ii) a Continuing to make the feature map F ₂₁ After 1 × 1 convolution and up-sampling, the feature map F ₃ Performing adaptive feature fusion to obtain a feature map F ₃₁ (ii) a Then, the feature map F is processed ₃₁ After 1 x 1 convolution and up-sampling, the feature map F ₄ Performing adaptive feature fusion to obtain a feature map F ₄₁ (ii) a Will feature chart F ₄₁ Directly outputting to obtain a characteristic diagram F ₄₂ (ii) a Then performing fusion of bottom to top and scale crossing to obtain F ₄₂ After 1X 1 convolution and down sampling, F is compared with F before ₃ And F ₃₁ Fusing to obtain a feature map F ₃₂ (ii) a F is to be ₃₂ After 1X 1 convolution and down sampling, F is compared with F before ₂ And F ₂₁ Fusing to obtain a feature map F ₂₂ (ii) a F is to be ₂₂ After 1X 1 convolution and down sampling, F is compared with F before ₁ And F ₁₁ Fusing to obtain a feature map F ₁₂ (ii) a After each feature fusion, a CBAM attention mechanism is performed to enhance spatial and channel information.

Further, the calculating the loss includes: class penalty, bounding box penalty and goal score penalty, which are BCELoss, bounding box penalty employs IOULoss.

According to the urban pedestrian flow small target detection method based on the convolutional neural network, a shallow layer feature layer with better fine-grained characteristics is introduced into the existing YOLOX technical scheme, and then the original PANet is replaced by the better BIAFPN to detect the target, so that the accuracy of urban portrait small target detection can be effectively improved.

Drawings

Fig. 1 is a flow chart of the urban pedestrian flow small target detection method based on the neural network.

Fig. 2 is a diagram of a neural network-based urban pedestrian flow small target detection network model.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The method for detecting the urban pedestrian flow small target based on the neural network mainly comprises the following steps: firstly, the image is subjected to data enhancement processing, then training is started in batches, and each batch obtains a feature map F by passing the image through a convolutional neural network ₁ 、F ₂ 、F ₃ 、F ₄ Then, the obtained characteristic diagram is subjected to characteristic fusion through BIAFPN to obtain F ₁₂ 、F ₂₂ 、F ₃₂ 、F ₄₂ And putting the characteristic graph into a prediction head for classification and regression to obtain a predicted value, comparing the predicted value with a true value of the image to calculate loss, performing back propagation to reduce loss after each batch of training is finished, and updating network parameters to finish the training of the network.

In one embodiment, as shown in fig. 1, a method for detecting a small urban traffic flow target based on a neural network is provided, including:

and step S1, acquiring an image training data set with a portrait small target detection box, and performing Mosaic data enhancement and MixUp data enhancement on the image training data set.

In the embodiment, the training data set is subjected to Mosaic data enhancement and MixUp data enhancement, wherein the Mosaic data enhancement is realized by taking out 4 images and splicing the images in a random zooming, random cutting and random arrangement mode. The MixUp data enhancement, i.e., superimposing 2 images together, can reduce the memory of the wrong tag to enhance robustness.

Step S2, adjusting the enhanced image training data set to the size of an input image, inputting the image training data set into the backbone network CSPDarknet-53, and acquiring the characteristics of four sizes output by the dark2 unit, the dark3 unit, the dark4 unit and the dark5 unit in the backbone network CSPDarknet-53FIG. F ₁ 、F ₂ 、F ₃ 、F ₄ 。

As shown in fig. 2, the present application uses the CSPDarknet-53 as a backbone network to perform feature extraction. The adopted CSPDarknet-53 is added with pre-training weight trained on COCO in advance, batch training is adopted, the batch processing size is 16 (namely each batch processes 16 pictures), the learning rate starts from 0.0025, a learning rate preheating method is not adopted, and the learning rate is updated by using a cosine annealing method.

Because the original picture is larger, the original picture is zoomed to 640 multiplied by 640 according to the equal proportion of the long side, the part with the short side less than 640 is filled with 0, the zoomed picture is input into a backbone network CSPDarknet-53, and after a series of operations such as convolution and the like, feature maps F with four sizes of 20 multiplied by 20,40 multiplied by 40,80 multiplied by 80 and 160 multiplied by 160 are output in sequence ₁ 、F ₂ 、F ₃ 、F ₄ 。

The size of the feature map is determined by the backbone network CSPDarknet-53 and will not be described herein. It should be noted that in YOLOX, only the features output by dark3, dark4, and dark5 are usually used for multi-scale fusion. In the embodiment, the characteristics output by the dark2, the dark3, the dark4 and the dark5 are adopted to perform multi-scale fusion operation, characteristic graphs of four sizes are output, shallow fine-grained information can be fused, small target detection is facilitated, and finally, one more pre-measuring head is correspondingly arranged, so that a better detection effect is achieved.

Step S3, feature map F with four sizes ₁ 、F ₂ 、F ₃ 、F ₄ Inputting the data into a feature fusion network BIAFPN to perform feature processing to obtain a fused feature map F ₁₂ 、F ₂₂ 、F ₃₂ 、F ₄₂ 。

Will feature chart F ₁ (the feature map output by the Dark5 in FIG. 2) is directly input into the feature fusion network BIAFPN, firstly from top to bottom, after 1 × 1 convolution, up-sampled and feature map F ₂ (the feature map output by the Dark4 in FIG. 2, and so on) is subjected to adaptive feature fusion to obtain a feature map F ₂₁ . Continuing to make the feature map F ₂₁ After 1 × 1 convolution, onAfter sampling and characteristic diagram F ₃ Performing adaptive feature fusion to obtain a feature map F ₃₁ . Then, the feature map F is processed ₃₁ After 1 x 1 convolution and up-sampling, the feature map F ₄ Performing adaptive feature fusion to obtain a feature map F ₄₁ A feature map F ₄₁ Directly outputting to obtain a characteristic diagram F ₄₂ . Then performing fusion of bottom to top and scale crossing to obtain F ₄₂ After 1X 1 convolution and down sampling, F is compared with F before ₃ And F ₃₁ Fusing to obtain a feature map F ₃₂ . F is to be ₃₂ After 1X 1 convolution and down sampling, F is compared with F before ₂ And F ₂₁ Fusing to obtain a feature map F ₂₂ . F is to be ₂₂ After 1X 1 convolution and down sampling, F is compared with F before ₁ And F ₁₁ Fusing to obtain a feature map F ₁₂ . After each feature fusion, a CBAM attention mechanism is performed to enhance spatial and channel information.

Specifically, 20 × 20 feature map F ₁ Directly inputting the data into a top-down characteristic pyramid network BIAFPN, convolving the data into the same channel by 1 multiplied by 1, then upsampling the data to form a 40 multiplied by 40 characteristic diagram and a characteristic diagram F ₂ Adaptive feature fusion SUM is made, and then the space and channel information are enhanced through a CBAM attention mechanism to obtain F ₂₁ (40 × 40). Continue to make F ₂₁ By 1 × 1 convolution, the feature map F is upsampled to 80 × 80 ₃ Adaptive feature fusion SUM is made, and then the space and channel information are enhanced through a CBAM attention mechanism to obtain F ₃₁ (80X 80). F is to be ₃₁ By 1 × 1 convolution, the feature map F is upsampled to 160 × 160 ₄ Adaptive feature fusion SUM is made, and then the space and channel information are enhanced through a CBAM attention mechanism to obtain F ₄₁ (160X 160), mixing F ₄₁ Directly outputting to obtain a characteristic diagram F ₄₂ (160X 160). Followed by bottom-up and cross-scale fusion, F ₄₁ The number of channels is converted into coincidence by 1 × 1 convolution, and then down-sampled to 80 × 80, and then matched with the feature map F ₃ And F ₃₁ Obtaining a feature map F by performing adaptive feature fusion SUM ₃₂ (80X 80). F is to be ₃₂ The number of channels is converted to unity by 1 x 1 convolution and then down-sampled toIs 40 × 40 histogram F ₂ And F ₂₁ Obtaining a feature map F by performing adaptive feature fusion SUM ₂₂ (40 × 40). F is to be ₂₂ The number of channels is converted into coincidence by 1 × 1 convolution, and then down-sampled to 20 × 20 and matched with the feature map F ₁ And F ₁₁ Performing adaptive feature fusion SUM to obtain a feature map F ₁₂ (20X 20). To this end, output signatures for all four sizes were obtained: f ₁₂ (20×20)、F ₂₂ (40×40)、F ₃₂ (80X 80) and F ₄₂ (160×160)。

It should be noted that, in fig. 2, DWSConv, i.e., DW convolution and PW convolution, is further included between the adaptive feature fusion SUM and the CBAM attention mechanism, and is not described herein again. In the embodiment, BIAFPN is adopted to replace the original PANet, and BIAFPN feature fusion is a method for better fusing features by adding a cross-scale fusion and a CBAM attention mechanism (enhancing the features above the space and the channel) on the basis of the original bidirectional fusion.

Step S4, merging the feature map F ₁₂ 、F ₂₂ 、F ₃₂ 、F ₄₂ Respectively transmitting the data into corresponding prediction heads, respectively performing convolution of classification branches and regression branches, then connecting the classification branches and the regression branches along a channel part, stretching the connected feature graph into one dimension to obtain a stretched feature graph F ₁₃ 、F ₂₃ 、F ₃₃ 、F ₄₃ Then drawing the feature map F ₁₃ 、F ₂₃ 、F ₃₃ 、F ₄₃ And connecting to obtain a final characteristic diagram, calculating loss, performing back propagation to update network parameters, and finishing training of the network.

In this embodiment, the convolution back edge channel portion of the Prediction header predictionclassification branch and the regression branch is connected to generate four new feature maps { WxHx[ (cls + reg + obj)]Xn, where W × H is the eigenmap size, cls is the detection class, reg is the prediction bounding box, obj is the goal score prediction, and N is the prediction anchor box number. Multiplying W by H, stretching the space dimension into one dimension, and obtaining a characteristic diagram F ₁₃ 、F ₂₃ 、F ₃₃ 、F ₄₃ . Then F is put along W x H ₁₃ 、F ₂₃ 、F ₃₃ 、F ₄₃ And connecting to obtain a final characteristic diagram F.

And finally, calculating classification loss, frame loss and target score loss, performing back propagation to reduce loss, and updating network parameters.

Specifically, F ₁₂ 、F ₂₂ 、F ₃₂ 、F ₄₂ After convolution of the classification branch and the regression branch, each feature map generates 3 new feature maps F _cls ∈{N×W×H×cls}、F _obj ∈{N×W×H×1}、F _reh E.g. N x W x H x 4, are connected along the channel portion to generate four new feature maps respectively { N x W x H x [ (cls + reg + obj)]The tensor of size, W, H ∈ {20,40,80,160 }. W is then multiplied by H, stretching the spatial dimension into one dimension resulting in four tensors of size { N × (cls + reg + obj) × (W × H) }. Then F is put along W x H ₁₃ 、F ₂₃ 、F ₃₃ 、F ₄₃ And (4) performing connection to obtain a final characteristic diagram F epsilon { N X (cls + reg + obj) × 34000 }.

Where cls is the category in the dataset and reg prediction bounding box includes the predicted top left corner (x) ₁ ，y ₁ ) And the lower right corner point (x) ₂ ，y ₂ ) N is the number of anchor frames preset, which is 1 in this embodiment.

In the embodiment, the prediction head adopts a decoupling head mode, and the classification branch and the regression branch are separately subjected to convolution operation, so that a better detection effect can be achieved. And the prediction of each position is reduced from 3 to 1 through the connection operation, and the problem of imbalance of positive and negative samples is avoided by adopting a mode without an anchor frame.

Because the output characteristic value cannot be directly used for loss calculation, regression needs to be performed first to obtain an actual predicted value. And (3) performing classification loss, border loss and target score loss on the feature graph F according to the following formulas, wherein the classification loss and the target score loss are BCELoss, and the border loss is IOULoss, and the specific formulas are as follows:

BCELoss＝-(ylog(p(x))+(1-y)log(1-p(x)))

it should be noted that the grid of the present application is disposed on the finally obtained feature map, is an abstract concept, and is intended to facilitate the frame regression calculation, and for the feature maps of 20 × 20,40 × 40,80 × 80, and 160 × 160, there are 20 × 20,40 × 40,80 × 80, and 160 × 160 grids, respectively, and the division of the feature map into multiple grids is a relatively mature technology in the art, and is not described herein again. Meanwhile, the CBAM attention mechanism is also a relatively mature technology in the field, and is not described herein again.

1. The classification Loss and the target score Loss were calculated using a Binary Cross Entropy Loss function (Binary Cross Entropy Loss):

BCELoss＝-(ylog(p(x))+(1-y)log(1-p(x)))

where y represents whether it is a goal, the value is 1 or 0, and p (x) is the predicted desirability score.

2. Calculating frame loss, predicting frame information and calculating IOU (interaction of Union) according to the real frame information obtained by label calculation, wherein the IOU is the intersection ratio of the predicted frame and the real frame, and the predicted frame with a high IOU value is obtained by NMS post-processing:

wherein

A real box (ground route), B a prediction box,

the area where the real box and the prediction box intersect,

the area of the union of the real box and the predicted box, the lower the IOULoss value, the more accurate the prediction.

It should be noted that the calculation of the classification loss, the target score loss, and the frame loss is already a relatively mature technology in the art, and is not described herein again.

The loss between the predicted value and the true value is obtained, and the loss is reduced by carrying out back propagation before the end of each batch. And simultaneously updating the network parameters, starting the training of the next batch until the training of the training data of all batches is finished, finally obtaining the trained weight, and storing all the updated parameters in an output weight file.

And step S5, inputting the image to be detected into the trained network to obtain a detection result.

The image to be detected is similarly scaled to a 640 x 640 size input network, feature maps of four sizes are output through a CSPDarknet-53 main network, and a predicted value including a class cls, a frame reg and a target score obj is obtained after regression is performed on the feature values, so that a final predicted result is obtained.

The application also adopts a SimOTA positive and negative sample distribution strategy. Firstly, the prediction boxes are screened out, and only the central points of the prediction boxes are kept in the group route and in the square with the side length of 5. After the primary screening is finished, calculating the frame Loss of the prediction frame and the frame of the group, calculating the classification Loss by using the two-classification cross entropy, and calculating a cost matrix:

representing the cost relationship between each real box and each feature point. The first k fixed prediction frames with the smallest loss of the grountrith are used as positive samples, and the rest are used as negative samples, so that additional hyper-parameters are avoided.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for detecting urban pedestrian flow small targets based on a convolutional neural network is characterized by comprising the following steps:

2. The convolutional neural network-based urban traffic small target detection method according to claim 1, wherein the Mosaic data enhancement comprises:

3. The convolutional neural network-based urban pedestrian flow small target detection method as claimed in claim 1, wherein the feature map F with four sizes ₁ 、F ₂ 、F ₃ 、F ₄ Inputting the data into a feature fusion network BIAFPN to perform feature processing to obtain a fused feature map F ₁₂ 、F ₂₂ 、F ₃₂ 、F ₄₂ The method comprises the following steps:

4. The convolutional neural network-based urban pedestrian flow small target detection method according to claim 1, wherein the calculating the loss comprises: class penalty, bounding box penalty and goal score penalty, which are BCELoss, bounding box penalty employs IOULoss.