CN114067186A

CN114067186A - Pedestrian detection method and device, electronic equipment and storage medium

Info

Publication number: CN114067186A
Application number: CN202111129908.9A
Authority: CN
Inventors: 刘浩; 张雷; 辛山; 卢云志
Original assignee: Beijing University of Civil Engineering and Architecture
Current assignee: Beijing University of Civil Engineering and Architecture
Priority date: 2021-09-26
Filing date: 2021-09-26
Publication date: 2022-02-18
Anticipated expiration: 2041-09-26
Also published as: CN114067186B

Abstract

The invention provides a pedestrian detection method, a pedestrian detection device, electronic equipment and a storage medium. The method comprises the following steps: acquiring a target picture to be identified; inputting the target picture to be recognized into a pedestrian detection model to obtain a pedestrian detection result; the pedestrian detection result is a picture containing a plurality of pedestrian detection frames; the pedestrian detection model is composed of a convolutional neural network, an attention mechanism network and a region candidate network. The pedestrian detection method provided by the invention enhances the detection capability of small object targets in the picture and further improves the accuracy of pedestrian detection.

Description

Pedestrian detection method and device, electronic equipment and storage medium

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a pedestrian detection method, a pedestrian detection device, electronic equipment and a storage medium.

Background

Pedestrian detection is an important research branch in the field of computer vision, and the main task is to judge whether a pedestrian appears in an input image or video sequence and determine the position of the pedestrian. The pedestrian detection technology is widely applied to a plurality of fields such as video monitoring, vehicle auxiliary driving, intelligent robots and the like.

In recent years, deep learning methods have made a significant breakthrough in target detection, and exhibit stronger detection capability than conventional methods. The existing target detection network, fast-RCNN, extracts the deep semantic information of the picture through the characteristics of the backbone network, is beneficial to the classification of objects and the identification of large object targets, but the shallow network is not well utilized, is not beneficial to the identification of small object targets, and has lower detection accuracy when being applied to pedestrian detection.

Disclosure of Invention

In order to solve the above problems, the present invention provides a pedestrian detection method, a pedestrian detection apparatus, an electronic device, and a storage medium, so as to solve the defect that the prior art is not favorable for detecting a small object target.

The invention provides a pedestrian detection method, which comprises the following steps:

acquiring a target picture to be identified;

inputting the target picture to be recognized into a pedestrian detection model to obtain a pedestrian detection result; the pedestrian detection result is a picture containing a plurality of pedestrian detection frames;

the pedestrian detection model is composed of a convolutional neural network, an attention mechanism network and a region candidate network, and is trained based on the following steps, and the pedestrian detection model comprises the following steps:

step 1, obtaining a sample picture, inputting the sample picture into a convolutional neural network for multi-scale feature fusion, and obtaining a multi-channel feature map fusing feature information of each layer of picture of the convolutional neural network; the sample picture comprises a pre-marked real pedestrian frame;

step 2, inputting the multi-channel feature map into an attention mechanism network, and respectively carrying out global average pooling and global maximum pooling on the multi-channel feature map to obtain a feature map aggregating important channel feature information in the multi-channel feature map;

and 3, inputting the feature map processed in the step 2 into a regional candidate network to obtain a picture containing a plurality of pedestrian candidate frames, calculating the Focal-EIOU-Loss according to the position relation between the candidate frames and the real frame, calculating the Smooth L1Loss according to the coordinates of the candidate frames and the real frame so as to take the sum of the Focal-EIOU-Loss and the Smooth L1Loss as a target Loss, optimizing the pedestrian detection model according to the target Loss, and finishing training when the target Loss is less than a preset threshold value so as to obtain the trained pedestrian detection model.

Optionally, inputting the sample picture into a convolutional neural network for multi-scale feature fusion to obtain a multi-channel feature map fusing feature information of pictures of each layer of the convolutional neural network, where the multi-channel feature map includes:

inputting the sample picture into a convolutional neural network for feature extraction, and before the convolutional neural network performs convolutional operation each time, stacking and feature fusing picture feature information in a feature map output by the current layer of the convolutional neural network and picture feature information in a feature map output by the previous layer to serve as a fused feature map of the current layer of the convolutional neural network, so as to finally obtain a multi-channel feature map fused with the picture feature information of each layer of the convolutional neural network.

Optionally, after the pedestrian detection result is obtained, the method further includes:

and determining the number of pedestrians appearing in the target picture to be recognized according to the pedestrian detection result.

Optionally, the performing global average pooling and global maximum pooling on the multi-channel feature map respectively to obtain a feature map aggregating important channel feature information in the multi-channel feature map includes:

determining important channel feature information in the multi-channel feature map based on a global maximum pooling formula, and aggregating the important channel feature information based on a global average pooling formula to obtain a feature map aggregating the important channel feature information in the multi-channel feature map.

Optionally, the Loss function L of the Focal-EIOU-Loss is calculated according to the position relationship between the candidate frame and the real frame_Focal-EIoUComprises the following steps:

wherein IoU represents the intersection ratio of the real frame and the candidate frame, p represents the distance between the parameters of the real frame and the candidate frame, C represents the straight-line distance between the upper left corner and the lower right corner of the image surrounded by the union of the real frame and the candidate frame, w represents the width of the real frame and the candidate frame, Cw represents the width loss of the real frame and the candidate frame, h represents the height of the real frame and the candidate frame, C represents the height of the real frame and the candidate frame, and C represents the width loss of the real frame and the candidate frame_hRepresenting high loss of real and predicted frames, and gamma representing a hyper-parameter.

Optionally, calculating a penalty function Smooth L1 of a Smooth L1Loss according to the coordinates of the candidate frame and the real frame is:

t＝(x1，y1，x2，y2)

x＝t-t^*

wherein x represents the abscissa of the candidate box and the real box, y represents the ordinate of the candidate box and the real box, t represents the coordinate of the candidate box, t represents the coordinate of the real box^*Representing the real box coordinates.

The present invention also provides a pedestrian detection device, including:

the acquisition module is used for acquiring a target picture to be identified;

the processing module is used for inputting the target picture to be recognized into a pedestrian detection model to obtain a pedestrian detection result; the pedestrian detection result is a picture containing a plurality of pedestrian detection frames;

acquiring a sample picture, inputting the sample picture into a convolutional neural network for multi-scale feature fusion to obtain a multi-channel feature map fusing feature information of each layer of picture of the convolutional neural network; the sample picture comprises a pre-marked real pedestrian frame;

inputting the multi-channel feature map into an attention mechanism network, and respectively carrying out global average pooling and global maximum pooling on the multi-channel feature map to obtain a feature map aggregating important channel feature information in the multi-channel feature map;

inputting the processed characteristic diagram into a regional candidate network to obtain a picture containing a plurality of pedestrian candidate frames, calculating the Focal-EIOU-Loss according to the position relation between the candidate frames and the real frame, calculating the Smooth L1Loss according to the coordinates of the candidate frames and the real frame so as to take the sum of the Focal-EIOU-Loss and the Smooth L1Loss as a target Loss, optimizing the pedestrian detection model according to the target Loss, and finishing training when the target Loss is less than a preset threshold value, thereby obtaining a trained pedestrian detection model.

The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the steps of the pedestrian detection method.

The invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the pedestrian detection method as described in any one of the above.

The invention also provides a computer program product comprising a computer program which, when executed by a processor, carries out the steps of the pedestrian detection method as described in any one of the above.

According to the pedestrian detection method, the device, the electronic equipment and the storage medium, the acquired target picture to be recognized is input into the trained pedestrian detection model, and a pedestrian detection result is obtained; the pedestrian detection result is a picture containing a plurality of pedestrian detection frames. The pedestrian detection model is composed of a convolutional neural network, an attention mechanism network and a region candidate network, and is trained based on the following steps, and the pedestrian detection model comprises the following steps: step 1, obtaining a sample picture, inputting the sample picture into a convolutional neural network for multi-scale feature fusion, and obtaining a multi-channel feature map fusing feature information of each layer of picture of the convolutional neural network; the sample picture comprises a pre-marked real pedestrian frame; step 2, inputting the multi-channel feature map into an attention mechanism network, and respectively carrying out global average pooling and global maximum pooling on the multi-channel feature map to obtain a feature map aggregating important channel feature information in the multi-channel feature map; and 3, inputting the feature map processed in the step 2 into a regional candidate network to obtain a picture containing a plurality of pedestrian candidate frames, calculating the Focal-EIOU-Loss according to the position relation between the candidate frames and the real frame, calculating the Smooth L1Loss according to the coordinates of the candidate frames and the real frame so as to take the sum of the Focal-EIOU-Loss and the Smooth L1Loss as a target Loss, optimizing the pedestrian detection model according to the target Loss, and finishing training when the target Loss is less than a preset threshold value so as to obtain the trained pedestrian detection model. Therefore, in the training process of the pedestrian detection model, the detection precision of the small object target is enhanced in a multi-scale feature fusion mode, the important information in the feature map is highlighted by adopting an attention mechanism combining global average pooling and global maximum pooling, so that the problems of background confusion, increased pedestrian number, overlapping between people and foreground confusion under a complex background are solved, and the problem that misleading caused by background information cannot be solved by the traditional IOU algorithm in a mode combining Focal-EIOU-Loss and Smooth L1Loss is solved.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of a pedestrian detection method provided by the present invention;

FIG. 2 is a schematic flow chart of pedestrian detection model training provided by the present invention;

FIG. 3 is a second schematic flow chart of the pedestrian detection model training provided by the present invention;

FIG. 4 is a third schematic flow chart of pedestrian detection model training provided by the present invention;

FIG. 5 is a schematic structural diagram of a pedestrian detection device provided by the present invention;

fig. 6 is a schematic structural diagram of an electronic device provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a schematic flow chart of a pedestrian detection method provided by the invention, fig. 2 is a schematic flow chart of a pedestrian detection model training provided by the invention, fig. 3 is a second schematic flow chart of a pedestrian detection model training provided by the invention, and fig. 4 is a third schematic flow chart of a pedestrian detection model training provided by the invention. The pedestrian detection method provided by the present invention is explained and explained in detail below with reference to fig. 1 to 4.

As shown in fig. 1, the pedestrian detection method provided by the present invention includes:

step 101: acquiring a target picture to be identified;

in this step, the target picture to be recognized includes a plurality of pedestrians to be detected.

Step 102: inputting the target picture to be recognized into a pedestrian detection model to obtain a pedestrian detection result; the pedestrian detection result is a picture containing a plurality of pedestrian detection frames;

In this step, it should be noted that the existing fast-RCNN network is not suitable for detecting a single object, such as a pedestrian, compared with a one-stage network, such as yolov3, yolov4, SSD, etc. The feature extraction of the fast-RCNN backbone network is beneficial to the classification of objects and the identification of large object targets, but the shallow network is not well utilized, so that the identification of small object targets is not facilitated. Therefore, when the pedestrian detection model is trained, the multi-scale feature fusion is firstly carried out on the sample picture based on the convolutional neural network ResNet 50. Specifically, as shown in fig. 2, the ResNet50 is used as the backbone feature network in the present invention, because the parameter quantity is small, the structure of ResNet50 can reduce the overfitting rate during the training process. The deep semantic information finally output by the ResNet50 network is helpful for object classification, but in order to improve the accuracy of pedestrian identification in the pedestrian detection counting process, not only the feature extraction of the shallow network needs to be enhanced, but also the identification of small object targets needs to be enhanced, and high-level semantic information is needed for classification and detection. With the deepening of the network, some information is lost every time the downsampling network is carried out, the image characteristic information in the characteristic diagram output by the current layer of the convolutional neural network and the image characteristic information in the characteristic diagram output by the previous layer are stacked and subjected to characteristic fusion before the network is subjected to convolution operation, and then the convolution operation is carried out, so that the information of the previous layer can be stored, and the information loss of the shallow layer network is reduced. For example, in fig. 3, the output feature map is up-sampled and convolved, and the feature map is stacked and convolved with information in a feature map of the size of the network (75, 512) of the previous layer. And performing up-sampling on the information after feature fusion, stacking the information with the size of (150,150,256) in the feature map, and performing convolution to perform top-down feature fusion on top-level information with the fusion channel numbers of 256, 512 and 1024. After down-sampling, stacking and convolution are carried out on the down-sampled feature and the feature subjected to feature fusion (75, 512). And continuously downsampling the feature map with the channel number of 512, stacking the feature map with the original shared feature layer, and finally performing batch standardization on a Relu activation function.

In this step, it should be noted that, in the training process of the existing pedestrian detection model, the network may default that the information importance in the feature map is the same, but actually is different, and in the face of background confusion, the number of pedestrians is increased, overlap between people, foreground confusion and background confusion in a complex background, if the channels in the feature map extracted by the features are processed with the same importance, the detection accuracy may be reduced. As shown in fig. 4, in order to solve the problem in the prior art, the present invention inputs the obtained multi-channel feature map into the attention mechanism network, and performs global average pooling and global maximum pooling on the multi-channel feature map to obtain a feature map aggregating important channel feature information in the multi-channel feature map. It can be understood that, because only global average pooling is considered in the existing Se-Net attention mechanism, the class of a specific object in the network can be identified more effectively, but specific effective information is not highlighted, so that the global maximum pooling is increased on the basis of aggregating the feature information by using global average pooling in the prior art, and by using the method, an attention can be given to important information in the feature map, and the situations of pedestrian density and pedestrian occlusion can be handled more effectively by stacking the two methods.

In this step, the feature map processed in step 2 is input into the regional candidate network, and a picture including a plurality of pedestrian candidate frames is obtained, where it should be noted that the sample picture includes a pedestrian real frame labeled in advance, so that the current picture includes a plurality of pedestrian candidate frames and a plurality of pedestrian real frames. The traditional IOU algorithm can intuitively distinguish a real frame and a candidate frame by utilizing a cross-over ratio, the candidate frame with a relatively large IOU coefficient is selected as a detection frame, but background information can mislead the judgment of a machine on an image in practical application, because of the existence of the background information, the situation that the coefficient of the detection frame is larger than that of one candidate frame can occur, so that the candidate frame which is closer to the real frame is removed in the process of non-maximum inhibition, and in addition, the situation that the candidate frame is not overlapped with the real frame but the IOU Loss is the same can occur in the training process. Although the IOU algorithm can intuitively express the overlapping degree of the anchor boxes, the predicted result is inaccurate due to the problems, and the adjustment result of the network is influenced. In order to solve the problems, the invention adopts an ELOU algorithm to calculate the Focal-EIOU-Loss according to the position relation between the candidate frame and the real frame, and particularly, the central point normalization distance is added on the basis of the IOU algorithm, so that the distance between the candidate frame and the real frame can be better expressed, the width and height Loss, the overlapping area Loss and the central distance Loss between the candidate frame and the real frame are increased, and the detected physical description between the candidate frame and the real frame is closer to the real frame. Because a Loss function considers that fast-RCNN screens out a large number of negative samples in a candidate frame stage, the proportion of positive and negative samples is fixed in a regression and classification stage, and the significance of solving the problem of imbalance of the proportion of the positive and negative samples is not large by considering Focal-EIOU-Loss, the invention also calculates Smooth L1Loss according to the coordinates of the candidate frame and the real frame, takes the sum of the Focal-EIOU-Loss Loss and the Smooth L1Loss as a target Loss, optimizes the pedestrian detection model according to the target Loss, and finishes training when the target Loss is less than a preset threshold value, thereby obtaining the trained pedestrian detection model. The pedestrian detection model is trained in the above mode, and the IOU is used as a standard in the process of non-maximum inhibition and AP precision calculation. The prediction result of the detection frame can be more accurate.

The pedestrian detection method provided by the invention obtains a pedestrian detection result by inputting the acquired target picture to be identified into a trained pedestrian detection model; the pedestrian detection result is a picture containing a plurality of pedestrian detection frames. The pedestrian detection model is composed of a convolutional neural network, an attention mechanism network and a region candidate network, and is trained based on the following steps, and the pedestrian detection model comprises the following steps: step 1, obtaining a sample picture, inputting the sample picture into a convolutional neural network for multi-scale feature fusion, and obtaining a multi-channel feature map fusing feature information of each layer of picture of the convolutional neural network; the sample picture comprises a pre-marked real pedestrian frame; step 2, inputting the multi-channel feature map into an attention mechanism network, and respectively carrying out global average pooling and global maximum pooling on the multi-channel feature map to obtain a feature map aggregating important channel feature information in the multi-channel feature map; and 3, inputting the feature map processed in the step 2 into a regional candidate network to obtain a picture containing a plurality of pedestrian candidate frames, calculating the Focal-EIOU-Loss according to the position relation between the candidate frames and the real frame, calculating the Smooth L1Loss according to the coordinates of the candidate frames and the real frame so as to take the sum of the Focal-EIOU-Loss and the Smooth L1Loss as a target Loss, optimizing the pedestrian detection model according to the target Loss, and finishing training when the target Loss is less than a preset threshold value so as to obtain the trained pedestrian detection model. Therefore, in the training process of the pedestrian detection model, the detection precision of the small object target is enhanced in a multi-scale feature fusion mode, the important information in the feature map is highlighted by adopting an attention mechanism combining global average pooling and global maximum pooling, so that the problems of background confusion, increased pedestrian number, overlapping between people and foreground confusion under a complex background are solved, and the problem that misleading caused by background information cannot be solved by the traditional IOU algorithm in a mode combining Focal-EIOU-Loss and Smooth L1Loss is solved.

Based on the content of the foregoing embodiment, in this embodiment, inputting the sample picture into a convolutional neural network for multi-scale feature fusion, so as to obtain a multi-channel feature map fusing feature information of pictures of each layer of the convolutional neural network, where the method includes:

Based on the content of the foregoing embodiment, in this embodiment, after obtaining the pedestrian detection result, the method further includes:

In this embodiment, it can be understood that the number of pedestrians appearing in the target picture to be recognized can be determined according to the detection frame in the output pedestrian detection result picture, so as to realize pedestrian detection counting.

Based on the content of the foregoing embodiment, in this embodiment, the performing global average pooling and global maximum pooling on the multi-channel feature map to obtain a feature map aggregating important channel feature information in the multi-channel feature map includes:

determining important channel feature information in the multi-channel feature map based on a global maximum pooling formula, and aggregating the important channel feature information based on a global average pooling formula to obtain a feature map aggregating the important channel feature information in the multi-channel feature map;

wherein the global maximum pooling formula is:

F₂＝GlobalMax(F₂)

F_B＝F_scale(u_c，s_c)＝F×F₂

wherein, F₂To share a feature layer, W₃And W₄And the weight coefficient is sigma, the Sigmiod activation function is sigma, the shared feature layer is u, the number of channels is c, the number of fully connected neurons is R, and the coefficient is obtained by dividing the number of channels by integer.

The global average pooling formula is as follows:

F_A＝F_scale(u_c，s_c)＝s_cu_c

wherein, F₁Representing a shared feature layer, W₁And W₂And the weight coefficient is sigma, the Sigmiod activation function is sigma, the shared feature layer is u, the number of channels is c, the number of fully connected neurons is R, and the coefficient is obtained by dividing the number of channels by integer.

Based on the content of the above embodiment, in this embodiment, the Loss function L of the Focal-EIOU-Loss is calculated according to the position relationship between the candidate frame and the real frame_Focal-EIoUComprises the following steps:

Based on the contents of the above embodiment, in the present embodiment, the penalty function Smooth L1 for calculating the Smooth L1Loss from the coordinates of the candidate frame and the real frame is:

t＝(x1，y1，x2，y2)

x＝t-t^*

The following is illustrated by specific examples:

the first embodiment is as follows:

in this embodiment, as shown in fig. 2, an input picture is first subjected to resize and then introduced into an improved ResNet50 network for Feature extraction, so as to obtain a shared Feature layer Feature Map (Feature Map). The shared feature layer is used as input to enter an improved attention mechanism network to perform channel direction concentration, a large weight is given to important information after feature extraction, and useful information in a feature map is paid more attention to. And after the attention mechanism processing, entering a region candidate network RPN to obtain a recommended candidate region (candidate frame). And performing non-maximum inhibition screening after passing through a regional suggestion network, replacing IOU Loss with the Loss formed by combining Smooth-Loss and EIOU to obtain a candidate region, and finally entering a Pooling picture of an ROI to obtain a fixed size for classifying and identifying the object.

Therefore, on one hand, the RPN network prediction is carried out based on the algorithm of the Faster-RCNN, and then the candidate region recommendation is carried out, so that the detection precision is improved. Aiming at the detection of a pedestrian detecting a single object, both a large object target and a small object target exist, semantic information extracted by the characteristics of an original algorithm is very deep, which is beneficial to the classification of the object but not beneficial to the identification of the small object target, the invention not only enhances the detection precision of the small object target, but also can not neglect the identification of the large object target, so that the fast-RCNN based on the ResNet50 structure is subjected to multi-scale characteristic fusion. On one hand, in order to solve the problems of background confusion, increased number of pedestrians, overlapping between people and foreground confusion and background confusion in a complex background, the Se-Net attention mechanism is applied to the invention to solve the problems, and meanwhile, the Se-Net attention mechanism only adopts global average pooling to aggregate feature information, but does not emphasize information of important channels in a feature map, so that global maximum pooling is increased. And add the above information to highlight the channels with high importance to solve such problems. On the other hand, aiming at the problems of the existing IOU algorithm, the invention adds the central point normalization distance on the basis of the IOU algorithm by referring to the Focal-EIOU algorithm, can better express the distance between frames, increases the width and height loss, the overlapping area loss and the central distance loss between the detection frame and the real frame, and leads the physical description between the detection frame and the real frame to be closer. Specifically, the Focal-LOSS is suitable for being used as a LOSS function of a one-stage algorithm, and for the problem, because the LOSS function considers that fast-RCNN screens out a large number of negative samples in a candidate frame stage, the proportion of positive and negative samples is fixed in a regression and classification stage, and the meaning of solving the problem of proportion unbalance of the positive and negative samples by the Focal-Loss is not large, the invention adopts a mode of combining the Focal-EIOU-Loss and the Smooth L1Loss to solve the problem of the IOU algorithm.

Optionally, the method is performed under a win10 system, and the y7000p and 2080ti display cards are adopted to train the network respectively through the above manner. In the aspect of pedestrian detection counting, 1052 sheets of marked color and black and white pictures are selected from the data set, and the sizes, postures and quantities of the human beings in the pictures, the fuzzy and clear degrees of the pedestrians, the illumination and the background environment are different, so that the pedestrian detection training model is suitable for being used as a pedestrian detection training model. The learning rate of the previous 50 generations of Epoch training is equal to 0.0001, and the learning rate of the later 50 generations of Epoch is equal to 0.00001. The data set type adopts VOC format, 946 is used for training, 106 is used for verifying each Epoch, each training completes one group of epochs, and Total-Loss and Val-Loss are obtained to check the fitting ability on the unseen data, so as to search the model with the optimal effect.

The invention considers the multi-scale training mode to improve the robustness of the training model, the training pictures are sequentially divided into (400 ), (600,600) and (800 ), the random scale is used for map comparison, and the experimental results are shown in the following table 1:

TABLE 1

After experimental comparison, the detection effect is superior to that of a random scale method due to the fact that training is carried out by adopting the (600 ) fixed-size, and map precision is 0.04% higher. The present invention uniformly trains the compared test results with a size of (600 ). The comparative results are shown in table 2:

TABLE 2

TABLE 3

TABLE 4

As can be seen, the ISE attention mechanism provided by the invention is added behind the backbone network of the Faster-RCNN, the Map precision is 75.02%, and the precision is improved by 2.76%. And the EIOU loss is adopted to divide positive and negative samples, so that the precision is improved by 0.75%. The detection precision is improved by 0.66% by adopting a multi-scale feature fusion method. It can be seen that the method provided by the invention has corresponding improvement on the detection precision of the Faster-RCNN. Finally, by adopting the method provided by the invention, the detection result is 75.64% which is improved by 3.38%.

To further compare the effectiveness of the improved attention mechanism incorporated herein, comparative experiments were performed in conjunction with the present algorithm with various attention mechanisms, and the results are shown in table 3. To demonstrate the effect of the EIOU-SmoothL1 algorithm presented herein, the experimental results are shown in table 4 when γ is 0.4, compared to the IOU, GIOU, CIOU algorithms in combination with the experimental model.

The pedestrian detection model provided by the invention not only enhances the detection capability of small object targets, but also can not neglect the identification and object classification capabilities of large object targets, so that the detection model used in the experiment can also be used as a general framework for target detection. The method can send the marked data set with the VOC format into a network for training, and completes the detection counting function by using the trained weight. To verify its validity, a voc2007 dataset was used for training. The voc2007 data set used 10000 pictures. It contains 20 classes. By comparing the improved Faster-RCNN based Renet101 algorithm, the resnet50-FPN algorithm, and the resnet101-FPN algorithm, with many other classes of algorithms. Map accuracy after training with the VOC2007 data set. Comparing as shown in table 5, in order to ensure the fairness of the experiment, Yolov4 itself adopts CIOU algorithm to perform loss calculation, and the rest algorithms all adopt IOU loss as the algorithm for classifying the positive and negative samples.

TABLE 5

As can be seen from the above table, ResNet101-fpn is more complex and improves accuracy most significantly than other networks in the same data training. The accuracy of the Map of the pedestrian detection model provided by the invention is 75.37%, and compared with other algorithms, the accuracy of the Map of the pedestrian detection model is correspondingly improved.

In the following description of the pedestrian detection device provided by the present invention, the pedestrian detection device described below and the pedestrian detection method described above may be referred to in correspondence with each other.

As shown in fig. 5, the present invention provides a pedestrian detection device, including:

the acquisition module 1 is used for acquiring a target picture to be identified;

the processing module 2 is used for inputting the target picture to be recognized into a pedestrian detection model to obtain a pedestrian detection result; the pedestrian detection result is a picture containing a plurality of pedestrian detection frames;

inputting the multi-channel feature map into an attention mechanism network, and performing global average pooling and global maximum pooling on the multi-channel feature map to obtain a feature map aggregating important channel feature information in the multi-channel feature map;

The pedestrian detection device provided by the invention obtains a pedestrian detection result by inputting the acquired target picture to be identified into the trained pedestrian detection model; the pedestrian detection result is a picture containing a plurality of pedestrian detection frames. The pedestrian detection model is composed of a convolutional neural network, an attention mechanism network and a region candidate network, and is trained based on the following steps, and the pedestrian detection model comprises the following steps: acquiring a sample picture, inputting the sample picture into a convolutional neural network for multi-scale feature fusion to obtain a multi-channel feature map fusing feature information of each layer of picture of the convolutional neural network; the sample picture comprises a pre-marked real pedestrian frame; inputting the multi-channel feature map into an attention mechanism network, and respectively carrying out global average pooling and global maximum pooling on the multi-channel feature map to obtain a feature map aggregating important channel feature information in the multi-channel feature map; inputting the processed characteristic diagram into a regional candidate network to obtain a picture containing a plurality of pedestrian candidate frames, calculating the Focal-EIOU-Loss according to the position relation between the candidate frames and the real frame, calculating the Smooth L1Loss according to the coordinates of the candidate frames and the real frame so as to take the sum of the Focal-EIOU-Loss and the Smooth L1Loss as a target Loss, optimizing the pedestrian detection model according to the target Loss, and finishing training when the target Loss is less than a preset threshold value, thereby obtaining a trained pedestrian detection model. Therefore, in the training process of the pedestrian detection model, the detection precision of the small object target is enhanced in a multi-scale feature fusion mode, the important information in the feature map is highlighted by adopting an attention mechanism combining global average pooling and global maximum pooling, so that the problems of background confusion, increased pedestrian number, overlapping between people and foreground confusion under a complex background are solved, and the problem that misleading caused by background information cannot be solved by the traditional IOU algorithm in a mode combining Focal-EIOU-Loss and Smooth L1Loss is solved.

Fig. 6 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 6: a processor (processor)610, a communication Interface (Communications Interface)620, a memory (memory)630 and a communication bus 640, wherein the processor 610, the communication Interface 620 and the memory 630 communicate with each other via the communication bus 640. The processor 610 may invoke logic instructions in the memory 630 to perform a pedestrian detection method comprising: acquiring a target picture to be identified; inputting the target picture to be recognized into a pedestrian detection model to obtain a pedestrian detection result; the pedestrian detection result is a picture containing a plurality of pedestrian detection frames; the pedestrian detection model is composed of a convolutional neural network, an attention mechanism network and a region candidate network, and is trained based on the following steps, and the pedestrian detection model comprises the following steps: step 1, obtaining a sample picture, inputting the sample picture into a convolutional neural network for multi-scale feature fusion, and obtaining a multi-channel feature map fusing feature information of each layer of picture of the convolutional neural network; the sample picture comprises a pre-marked real pedestrian frame; step 2, inputting the multi-channel feature map into an attention mechanism network, and performing global average pooling and global maximum pooling on the multi-channel feature map to obtain a feature map aggregating important channel feature information in the multi-channel feature map; and 3, inputting the feature map processed in the step 2 into a regional candidate network to obtain a picture containing a plurality of pedestrian candidate frames, calculating the Focal-EIOU-Loss according to the position relation between the candidate frames and the real frame, calculating the Smooth L1Loss according to the coordinates of the candidate frames and the real frame so as to take the sum of the Focal-EIOU-Loss and the Smooth L1Loss as a target Loss, optimizing the pedestrian detection model according to the target Loss, and finishing training when the target Loss is less than a preset threshold value so as to obtain the trained pedestrian detection model.

In addition, the logic instructions in the memory 630 may be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program, the computer program being storable on a non-transitory computer-readable storage medium, the computer program, when executed by a processor, being capable of executing the pedestrian detection method provided by the above methods, the method comprising: acquiring a target picture to be identified; inputting the target picture to be recognized into a pedestrian detection model to obtain a pedestrian detection result; the pedestrian detection result is a picture containing a plurality of pedestrian detection frames; the pedestrian detection model is composed of a convolutional neural network, an attention mechanism network and a region candidate network, and is trained based on the following steps, and the pedestrian detection model comprises the following steps: step 1, obtaining a sample picture, inputting the sample picture into a convolutional neural network for multi-scale feature fusion, and obtaining a multi-channel feature map fusing feature information of each layer of picture of the convolutional neural network; the sample picture comprises a pre-marked real pedestrian frame; step 2, inputting the multi-channel feature map into an attention mechanism network, and performing global average pooling and global maximum pooling on the multi-channel feature map to obtain a feature map aggregating important channel feature information in the multi-channel feature map; and 3, inputting the feature map processed in the step 2 into a regional candidate network to obtain a picture containing a plurality of pedestrian candidate frames, calculating the Focal-EIOU-Loss according to the position relation between the candidate frames and the real frame, calculating the Smooth L1Loss according to the coordinates of the candidate frames and the real frame so as to take the sum of the Focal-EIOU-Loss and the Smooth L1Loss as a target Loss, optimizing the pedestrian detection model according to the target Loss, and finishing training when the target Loss is less than a preset threshold value so as to obtain the trained pedestrian detection model.

In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program, which when executed by a processor, implements a pedestrian detection method provided by performing the above methods, the method including: acquiring a target picture to be identified; inputting the target picture to be recognized into a pedestrian detection model to obtain a pedestrian detection result; the pedestrian detection result is a picture containing a plurality of pedestrian detection frames; the pedestrian detection model is composed of a convolutional neural network, an attention mechanism network and a region candidate network, and is trained based on the following steps, and the pedestrian detection model comprises the following steps: step 1, obtaining a sample picture, inputting the sample picture into a convolutional neural network for multi-scale feature fusion, and obtaining a multi-channel feature map fusing feature information of each layer of picture of the convolutional neural network; the sample picture comprises a pre-marked real pedestrian frame; step 2, inputting the multi-channel feature map into an attention mechanism network, and performing global average pooling and global maximum pooling on the multi-channel feature map to obtain a feature map aggregating important channel feature information in the multi-channel feature map; and 3, inputting the feature map processed in the step 2 into a regional candidate network to obtain a picture containing a plurality of pedestrian candidate frames, calculating the Focal-EIOU-Loss according to the position relation between the candidate frames and the real frame, calculating the Smooth L1Loss according to the coordinates of the candidate frames and the real frame so as to take the sum of the Focal-EIOU-Loss and the Smooth L1Loss as a target Loss, optimizing the pedestrian detection model according to the target Loss, and finishing training when the target Loss is less than a preset threshold value so as to obtain the trained pedestrian detection model.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A pedestrian detection method, characterized by comprising:

acquiring a target picture to be identified;

step 2, inputting the multi-channel feature map into an attention mechanism network, and performing global average pooling and global maximum pooling on the multi-channel feature map to obtain a feature map aggregating important channel feature information in the multi-channel feature map;

2. The pedestrian detection method according to claim 1, wherein the step of inputting the sample picture into a convolutional neural network for multi-scale feature fusion to obtain a multi-channel feature map fusing feature information of pictures of each layer of the convolutional neural network comprises the following steps:

3. The pedestrian detection method according to claim 1, further comprising, after said obtaining the pedestrian detection result:

4. The pedestrian detection method according to claim 1, wherein the performing global average pooling and global maximum pooling on the multi-channel feature map to obtain a feature map aggregating important channel feature information in the multi-channel feature map comprises:

5. The pedestrian detection method according to claim 1, wherein the Loss function L of the Focal-EIOU-Loss is calculated according to the positional relationship between the candidate frame and the real frame_Focal-EIoUComprises the following steps:

6. The pedestrian detection method according to claim 1, wherein a penalty function Smooth L1 that calculates a Smooth L1Loss from the coordinates of the candidate frame and the real frame is:

t＝(x1，y1，x2，y2)

x＝t-t^*

7. A pedestrian detection device, characterized by comprising:

the acquisition module is used for acquiring a target picture to be identified;

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the pedestrian detection method according to any one of claims 1 to 6 are implemented when the program is executed by the processor.

9. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the pedestrian detection method according to any one of claims 1 to 6.

10. A computer program product comprising a computer program, characterized in that the computer program realizes the steps of the pedestrian detection method according to any one of claims 1 to 6 when executed by a processor.