CN108898047B

CN108898047B - Pedestrian detection method and system based on blocking and shielding perception

Info

Publication number: CN108898047B
Application number: CN201810393658.1A
Authority: CN
Inventors: 雷震; 张士峰; 庄楚斌
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2018-04-27
Filing date: 2018-04-27
Publication date: 2021-03-19
Anticipated expiration: 2038-04-27
Also published as: CN108898047A

Abstract

The invention belongs to the technical field of pattern recognition, and particularly relates to a pedestrian detection method and system based on blocking shielding perception, aiming at solving the technical problem of low pedestrian detection accuracy caused by the fact that pedestrians are shielded. To this end, the pedestrian detection method of the invention includes: acquiring image characteristics corresponding to each preset human body detection area based on a pre-constructed pedestrian detection model and according to a to-be-detected pedestrian image; performing feature fusion on the acquired image features to obtain the overall features of the corresponding pedestrians; acquiring a plurality of detection result frames of the image of the pedestrian to be detected according to the overall characteristics; and selecting a detection result frame which meets preset screening conditions from the obtained detection result frames. Based on the steps, the blocked pedestrians in the image to be detected can be effectively detected. Meanwhile, the pedestrian detection system of the invention can execute and realize the method.

Description

Pedestrian detection method and system based on blocking and shielding perception

Technical Field

The invention belongs to the technical field of pattern recognition, and particularly relates to a pedestrian detection method and system based on blocking shielding perception.

Background

The pedestrian detection technology is a technology for automatically searching the position and size of a pedestrian in an arbitrary input image, and is widely applied to the fields of computer vision, pattern recognition and the like, such as automatic driving, video monitoring, biometric recognition and the like.

Under a complex environment in real life, the problem of blocking of pedestrians is one of the biggest challenges facing pedestrian detection at present, and especially under a crowded scene, how to perform efficient and accurate pedestrian detection is a hot spot and a difficult point of research. To address this problem, most current pedestrian detection methods use a block-based model, by learning a series of block detectors, and synthesizing the results of each detector for final pedestrian localization. These methods, however, simply require that each predicted detection window be as close as possible to the pedestrian labeling box without taking into account the inherent link between them. Therefore, the performance of these pedestrian detectors is very sensitive to the setting of Non Maximum Suppression (Non Maximum Suppression) thresholds, which have a greater impact on the detector performance, especially in large-scale crowded scenes.

Disclosure of Invention

In order to solve the above-mentioned problems in the prior art, that is, to solve the technical problem of low accuracy of pedestrian detection caused by blocking a pedestrian, an aspect of the present invention provides a pedestrian detection method based on blocking perception, including:

acquiring image characteristics corresponding to each preset human body detection area based on a pre-constructed pedestrian detection model and according to a to-be-detected pedestrian image;

performing feature fusion on the acquired image features to obtain the overall features of the corresponding pedestrians;

acquiring a plurality of detection result frames of the to-be-detected pedestrian image according to the overall characteristics;

selecting a detection result frame which meets a preset screening condition from the obtained detection result frames;

the pedestrian detection model is a model constructed based on an Faster R-CNN neural network, and an anchor point frame is associated in a high convolution layer of the Faster R-CNN neural network.

Further, before "based on a pre-constructed pedestrian detection model, and according to a to-be-detected pedestrian image, acquiring image features corresponding to each preset human body detection area", the method further includes:

performing data amplification processing on the preset training image to obtain a training sample;

matching the anchor point frame with a pedestrian marking frame in the training sample, and dividing the anchor point frame into a positive sample and a negative sample according to a matching result; the positive sample is an anchor point frame matched with the pedestrian marking frame, and the negative sample is an anchor point frame not matched with the pedestrian marking frame;

selecting a preset first number of negative samples by adopting a difficult negative sample mining method;

calculating a loss function value according to the positive sample and the selected negative sample, and updating the Faster R-CNN neural network according to the loss function value; and (4) network training is carried out again on the updated Faster R-CNN neural network until the updated Faster R-CNN neural network meets the preset convergence condition.

Further, the Faster R-CNN neural network comprises an RPN module; before "based on a pre-constructed pedestrian detection model, and according to a to-be-detected pedestrian image, acquiring image features corresponding to each preset human body detection area", the method further includes:

based on a preset training image and according to a loss function shown in the following formula, performing network training on the RPN module:

wherein the content of the first and second substances,

in order to classify the loss function for the pedestrian,

for the aggregate loss function, i denotes the anchor box index, p_iAnd t_iRespectively representing the prediction probability that the ith anchor point frame is a pedestrian and the prediction coordinate corresponding to the pedestrian;

and

respectively representing the object class label associated with the ith anchor point box and the corresponding calibration coordinates, alpha₁Is a first hyperparameter;

the pedestrian classification loss function is:

wherein N is_clsThe total number of anchor frames in the RPN module classification process is obtained;

the aggregate loss function:

wherein the content of the first and second substances,

in order to be a function of the regression loss,

β is a second hyperparameter, which is a compactness loss function;

the regression loss function is:

wherein N is_regThe total number of anchor boxes for the regression phase,

is about the predicted detection window t_iL of₁A loss value of a loss function;

the compactness loss function is:

wherein N is_comFor the total number of pedestrians that intersect the anchor frame, | Φ_iI is the total number of anchor points associated with the ith calibrated pedestrian, j is the anchor point mark number, t_jMarking the coordinates corresponding to the predicted jth anchor point frame pedestrian, and p is the anchor point frame mark serial number associated with the calibrated pedestrian window, phi_pAnchor boxes associated with the calibrated pedestrian window are marked.

Further, the Faster R-CNN neural network also comprises a Fast R-CNN module; before "based on a pre-constructed pedestrian detection model, and according to a to-be-detected pedestrian image, acquiring image features corresponding to each preset human body detection area", the method further includes:

based on a preset training image, performing network training on the Fast R-CNN module according to a loss function shown in the following formula:

wherein the content of the first and second substances,

in order to classify the loss function for the pedestrian,

in order to be a function of the polymerization loss,

for occlusion handling loss functions, i denotes the anchor box label, p_iAnd t_iRespectively representing the prediction probability that the ith anchor point frame is a pedestrian and the prediction coordinate corresponding to the pedestrian;

and

respectively representing the object class label associated with the ith anchor point box and the corresponding calibration coordinates, alpha₃Is a third hyperparameter, and lambda is a fourth hyperparameter;

the pedestrian classification loss function is:

the aggregate loss function:

wherein the content of the first and second substances,

in order to be a function of the regression loss,

β is a second hyperparameter, which is a compactness loss function;

the regression loss function is:

wherein N is_regThe total number of anchor boxes for the regression phase,

the compactness loss function is:

Further, the step of "matching the anchor point frame with the pedestrian labeling frame in the training sample" specifically includes:

calculating the intersection, division and superposition ratio of each anchor point frame and each pedestrian marking frame;

selecting an anchor point frame which is intersected with each pedestrian marking frame and has the largest overlapping ratio, and matching each selected anchor point frame with each corresponding human face marking frame;

judging whether the intersection and superposition ratio of the rest anchor frames and each pedestrian marking frame is greater than a preset first threshold value or not after the selected anchor frames are removed: if so, matching;

acquiring human face labeling frames of which the matching number of the anchor frames is less than a preset second number, and selecting all anchor frames which are subjected to intersection and superposition with each pedestrian labeling frame and have a superposition ratio greater than a preset second threshold; the preset first threshold is larger than a preset second threshold;

selecting a preset third number of anchor frames to match with the corresponding pedestrian marking frames according to the sequence of the intersection and the superposition of all the selected anchor frames from large to small; and the value of the preset third quantity is the average matching quantity of the anchor points of the pedestrian labeling frames of which the matching quantity is greater than or equal to the preset second quantity.

In another aspect of the present invention, a pedestrian detection system based on block occlusion perception is further provided, including:

the image characteristic acquisition module is configured to acquire image characteristics corresponding to each preset human body detection area based on a pre-constructed pedestrian detection model and according to a to-be-detected pedestrian image;

the image feature fusion module is configured to perform feature fusion on the image features acquired by the image feature acquisition module to obtain the overall features of the corresponding pedestrians;

the detection result frame acquisition module is configured to acquire a plurality of detection result frames of the to-be-detected pedestrian image according to the overall characteristics obtained by the image characteristic fusion module;

a detection result frame screening module configured to select a detection result frame satisfying a preset screening condition among the plurality of obtained detection result frames;

Further, the pedestrian detection system further comprises a model training module, the model training module comprising:

the training image processing unit is configured to perform data amplification processing on the preset training image to obtain a training sample;

the positive and negative sample dividing unit is configured to match the anchor point frame with a pedestrian marking frame in the training sample and divide the anchor point frame into a positive sample and a negative sample according to a matching result; the positive sample is an anchor point frame matched with the pedestrian marking frame, and the negative sample is an anchor point frame not matched with the pedestrian marking frame;

the negative sample screening unit is configured to select a preset first number of negative samples by adopting a difficult negative sample mining method;

a network updating unit configured to calculate a loss function value according to the positive sample and the selected negative sample, and update the Faster R-CNN neural network according to the loss function value; and (4) network training is carried out again on the updated Faster R-CNN neural network until the updated Faster R-CNN neural network meets the preset convergence condition.

Further, the Faster R-CNN neural network comprises an RPN module; in this case, the model training module is further configured to perform the following operations:

wherein the content of the first and second substances,

in order to classify the loss function for the pedestrian,

for the aggregate loss function, i denotes the anchor box index, p_iAnd t_iRespectively representing the prediction probability that the ith anchor point frame is a pedestrian and the corresponding prediction coordinate of the pedestrian；

And

the pedestrian classification loss function is:

the aggregate loss function:

wherein the content of the first and second substances,

in order to be a function of the regression loss,

β is a second hyperparameter, which is a compactness loss function;

the regression loss function is:

wherein N is_regThe total number of anchor boxes for the regression phase,

the compactness loss function is:

Further, the Faster R-CNN neural network comprises a Fast R-CNN module; in this case, the model training module is further configured to perform the following operations:

based on a preset training image and according to a loss function shown as the following formula, performing network training on the Fast R-CNN module:

wherein the content of the first and second substances,

in order to classify the loss function for the pedestrian,

in order to be a function of the polymerization loss,

and

respectively representing association with the ith anchor blockObject class labels and corresponding calibration coordinates, alpha₃Is a third hyperparameter, and lambda is a fourth hyperparameter;

the pedestrian classification loss function is:

the aggregate loss function:

wherein the content of the first and second substances,

in order to be a function of the regression loss,

β is a second hyperparameter, which is a compactness loss function;

the regression loss function is:

wherein N is_regThe total number of anchor boxes for the regression phase,

the compactness loss function is:

wherein N is_comFor the total number of pedestrians that intersect the anchor frame, | Φ_iI is the mark of the ithDetermining the total number of anchor points associated with the pedestrian, j being the anchor point mark number, t_jMarking the coordinates corresponding to the predicted jth anchor point frame pedestrian, and p is the anchor point frame mark serial number associated with the calibrated pedestrian window, phi_pAnchor boxes associated with the calibrated pedestrian window are marked.

Further, the positive and negative sample division unit includes:

an intersection and overlap ratio calculation subunit configured to calculate an intersection and overlap ratio of each anchor point frame and each pedestrian labeling frame;

the first matching subunit is configured to select the anchor point frame which is subjected to intersection with each pedestrian marking frame and has the largest overlapping ratio, and match each selected anchor point frame with each corresponding face marking frame;

a second matching subunit, configured to determine whether, after removing the selected anchor point frame, the intersection and superposition ratio of the remaining anchor point frames to each pedestrian labeling frame is greater than a preset first threshold: if so, matching;

the third matching subunit is configured to acquire the face labeling frames of which the matching number of the anchor frames is less than a preset second number, and select all the anchor frames of which the intersection and superposition ratio with each pedestrian labeling frame is greater than a preset second threshold; the preset first threshold is larger than a preset second threshold;

the fourth matching subunit is configured to select a preset third number of anchor frames to match with the corresponding pedestrian marking frames according to the sequence of the intersection and the superposition of all the selected anchor frames from large to small; and the value of the preset third quantity is the average matching quantity of the anchor points of the human face labeling frames of which the matching quantity is greater than or equal to the preset second quantity.

Compared with the closest prior art, the technical scheme at least has the following beneficial effects:

1. according to the pedestrian detection method based on block shielding perception, provided by the invention, the image characteristics of pedestrians are obtained in a block mode according to the preset human body detection area according to the pedestrian detection model established by the Faster R-CNN neural network, and then the obtained image characteristics are fused, so that the shielded pedestrians in the image to be detected can be effectively detected.

2. The high convolution layer in the pedestrian detection model provided by the invention is associated with the anchor point frame, and the high convolution base layer can extract deeper semantic information, so that the pedestrian detection precision is improved.

3. The pedestrian detection system based on the block shielding perception can realize the pedestrian detection method based on the block shielding perception.

Drawings

FIG. 1 is a schematic diagram of main steps of a pedestrian detection method based on block occlusion perception in an embodiment of the present invention;

FIG. 2 is a schematic diagram of a main structure of a block occlusion aware ROI pooling unit according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a main structure of an occlusion processing unit for sensing blocking in an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a pedestrian detection system based on block occlusion perception in an embodiment of the present invention.

Detailed Description

Preferred embodiments of the present invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are only for explaining the technical principle of the present invention, and are not intended to limit the scope of the present invention.

The pedestrian is very easy to be sheltered from in large-scale crowded environment, is difficult to carry out effective detection to the pedestrian in this way the pedestrian detection process. Based on the method, the pedestrian detection method based on the blocking shielding perception can be used for efficiently and accurately detecting pedestrians in a complex environment, and a satisfactory detection result can be still obtained under the condition that large-scale shielding exists.

The following describes a pedestrian detection method based on block occlusion perception according to the present invention with reference to the accompanying drawings.

Fig. 1 exemplarily shows an implementation flow of a pedestrian detection method based on block occlusion perception in this embodiment, and as shown in fig. 1, the pedestrian detection method based on block occlusion perception in this embodiment may include the following steps:

step S101: and acquiring image characteristics corresponding to each preset human body detection area based on a pre-constructed pedestrian detection model and according to the image of the pedestrian to be detected.

Step S102: and carrying out feature fusion on the acquired image features to obtain the overall features of the corresponding pedestrians.

Step S103: and acquiring a plurality of detection result frames of the pedestrian image to be detected according to the overall characteristics.

Step S104: selecting a detection result frame which meets a preset screening condition from the obtained multiple detection result frames;

specifically, the pedestrian detection model in the present embodiment is a model constructed based on the Faster R-CNN neural network, and an anchor point frame is associated in the high convolution layer of the Faster R-CNN neural network. The dimensions and associated layers of the anchor block of the pedestrian detection model, as well as the basic network framework of the design, are described in detail below.

In the design of the size of the anchor point frame and the related layer, the abundance degree of semantic information and spatial information of feature maps extracted by different convolution layers is different, and considering that under the condition of large-scale shielding, the feature information of a target pedestrian is difficult to extract due to the shielding, and more semantic information is required to support. In practical application, the pedestrian target with the extremely small size as the human face detection does not exist, and the requirement on the spatial information is greatly reduced. The semantic information contained in the features of the bottom layer of the shallow neural network is shallow, and the identification capability of the object with a larger scale is insufficient due to the small receptive field; moreover, because the extracted shallow features lack enough semantic information, under the interference of shielding and the like, the performance of the resolution device is greatly reduced and the robustness is insufficient due to the fact that the extraction of the features is more difficult; and the deep neural network layer can extract deeper semantic information and global information, and although part of spatial information is lost, the characteristics of the deep convolutional layer can effectively overcome the problem of insufficient feature extraction caused by occlusion in a complex environment, particularly under the condition of occlusion.

Therefore, in this embodiment, the top convolutional layer (i.e., the high convolutional layer) is selected to be associated with the anchor frame. For example, if the VGG-16 model is selected as the basic architecture and the selected high-level convolution layer is conv5_3, then a pedestrian image to be detected with a size of 1000 × 600 is obtained, and the corresponding feature map size is 60 × 40. In order to realize the detection of pedestrians with different sizes in the image, for each position of the feature map, anchor point frames with 11 different sizes are densely paved: the areas are respectively (32)²,43²,58²,78²,106²,144²,194²,261²，353²，477²，643²) The width-to-height ratio of all anchor points is 0.41 (human body approximate scale), so as to realize pedestrian detection of different sizes in the image.

For the problems of false detection and missed detection caused by occlusion between pedestrians, in the network framework of the pedestrian detection model in the embodiment, a RoI pooling layer in an original Fast R-CNN module in the pedestrian detection model is replaced by a partitioned occlusion sensing RoI pooling unit, and the RoI pooling unit is used for integrating structural information of different positions of a human body, inputting the integrated information into the Fast R-CNN module, and estimating the occlusion state through a small neural network.

Referring to fig. 2, fig. 2 illustrates the main structure of a block occlusion aware ROI pooling unit in this embodiment. As shown in fig. 2, the body region is first divided into five sections, each of which uses the RoI pooling layer to sample the features into a small feature map of fixed size (7 in both width and height). Then, based on the obtained feature maps of the different human body regions, the visibility of each part is estimated using an occlusion processing unit. Referring to fig. 3, fig. 3 is a schematic diagram illustrating a main structure of an occlusion processing unit for block occlusion sensing according to an embodiment of the present invention, and as shown in fig. 3, the occlusion processing unit is composed of three convolution layers followed by a softmax layer, and performs parameter training on the occlusion processing unit by using a log loss function. In particular, assume c_i,jThe jth part, o, representing the ith candidate window_i,jIndicating visibility of corresponding predictionsThe method comprises the following steps of dividing,

is the true visibility score for the corresponding calibration. If c is_i,jIs more than half visible, then

Otherwise it is 0. Mathematically, i.e. if c_i,jThe intersection and parallel ratio between the calibration window and the corresponding calibration window is greater than or equal to 0.5

Otherwise it is 0. Formula (1) shows a formula for scoring the visibility of each part based on the occlusion processing unit,

wherein Ω (. cndot.) is an area calculation function, U (c)_i,j) Is c_i,jThe area of (a) is,

is composed of

Theta is the set intersection ratio threshold, here set to 0.5, indicating that more than half of the portion is visible, then

Otherwise it is 0. Therefore, the present embodiment defines the loss function of the occlusion handling unit according to equation (2):

wherein i is the reference number of the anchor frame, t_iFor the coordinates corresponding to the predicted pedestrian at the ith anchor point frame,

and the calibration coordinates of the object associated with the ith anchor point frame.

And then, performing dot product operation on the feature map of each human body part and the corresponding prediction visibility to obtain the final feature, wherein the feature dimension is 512 multiplied by 7. Finally, the characteristic graphs of the five parts of the human body are added one by one according to elements and are used for the classification and window regression of the Fast R-CNN module.

Further, in the pedestrian detection method shown in fig. 1 in this embodiment, the pedestrian detection model may be subjected to network training according to a preset training image, so as to obtain the pedestrian detection model meeting a preset convergence condition.

Specifically, in this embodiment, the network training may be performed on the pedestrian detection model according to the following steps:

step S201: and carrying out data augmentation processing on the preset training image to obtain a training sample.

In this embodiment, the data amplification processing on the training image may include color dithering operation, random cropping operation, horizontal flipping operation, and scale transformation operation:

firstly, performing color dithering operation on a training image, specifically: parameters such as brightness, contrast, and saturation of the training image were randomly adjusted with a probability of 0.6, respectively.

Secondly, random clipping operation is carried out on the training image after the color dithering operation, and the method specifically comprises the following steps: 6 sub-images of squares are randomly cropped. Wherein, 1 sub-image is the largest square sub-image in the training image, and the side lengths of the other 4 sub-images are 0.4-1.0 times of the short side of the training image. And randomly selecting 1 sub-image from 5 sub-images as a final training sample.

Thirdly, carrying out horizontal turning operation on the selected training sample, specifically: the horizontal flipping operation can be performed randomly with a probability of 0.6.

Finally, carrying out scale transformation operation on the training sample after the horizontal overturning operation, specifically comprising the following steps: the training samples are scaled to 1000 x 600 images.

In the embodiment, the color dithering operation, the random clipping operation, the horizontal turning operation and the scale transformation operation are sequentially performed on the training image, so that the data volume can be increased under the condition of not changing the image category, and the generalization capability of the model can be improved.

Step S202: matching the anchor point frame with a pedestrian marking frame in the training sample, and dividing the anchor point frame into a positive sample and a negative sample according to a matching result; wherein, the positive sample is the anchor point frame matched with the pedestrian mark frame, and the negative sample is the anchor point frame unmatched with the pedestrian mark frame.

Specifically, in order to solve the problem that a part of pedestrians cannot match enough anchor points under the existing matching strategy, the invention adopts a certain compensation strategy for the marking frame. The steps of matching the anchor point frame with the pedestrian marking frame in the training sample are as follows:

firstly, calculating the intersection and superposition ratio of each anchor point frame and each pedestrian marking frame;

secondly, selecting the anchor point frame which is intersected with each pedestrian marking frame and has the largest overlapping ratio, and matching each selected anchor point frame with each corresponding pedestrian marking frame;

thirdly, after the selected anchor point frame is removed, judging whether the intersection and superposition ratio of the rest anchor point frames and each pedestrian marking frame is larger than a preset first threshold value: if so, matching; in this embodiment, the first threshold is 0.4, and it should be noted that the average value of the number of anchor blocks matched by all pedestrian labeling blocks matched with enough anchor blocks is N_p。

Thirdly, acquiring pedestrian marking frames with the matching number of the anchor frames smaller than the preset second number, and selecting all anchor frames with the intersection and superposition ratio of each pedestrian marking frame larger than the preset second threshold; the preset first threshold value is larger than a preset second threshold value; in this embodiment, the step is a scale compensation operation for missing and missing repair, the second threshold is set to 0.1, and for pedestrian labeling frames that are not matched with enough anchor frames, all anchor frames that are intersected with the pedestrian labeling frame and have a superposition ratio greater than 0.1 are selected. Equation (3) shows that all anchor box sequences with an intersection and overlap ratio greater than 0.1:

[a₁,a₂,a₃,...,a_N] (3)

wherein, a_NIncluding the location and size of the anchor box.

Finally, according to the sequence of the intersection and the superposition of all the selected anchor point frames from large to small, selecting a preset third number of anchor point frames to match with the corresponding pedestrian marking frames; in the implementation, according to the intersection ratio of the pedestrian mark frames and the pedestrian mark frames, the pedestrian mark frames are sorted in a descending order according to a formula (4),

[A₁,A₂,A₃,...,A_N] (4)

finally, select the first N_pAnd the anchor point frame is used as the anchor point frame matched with the pedestrian marking frame. Wherein N is_pAnd setting the average matching number of the pedestrian labeling boxes as an adjustable parameter by default.

And the value of the preset third quantity is the average matching quantity of the anchor points of the human face labeling frames of which the matching quantity is greater than or equal to the preset second quantity.

Step S203: selecting a preset first number of negative samples by adopting a difficult negative sample mining method

Specifically, for all negative samples, calculating error values brought by classification prediction of the negative samples, performing descending sorting according to the error values, selecting a batch of negative samples with the largest error values as the negative samples of the training data set, and discarding all the other negative samples to ensure that the quantity ratio of the positive samples to the negative samples is 1: 3. Therefore, the positive and negative samples have a relatively balanced quantitative relation, which is beneficial to the smooth network training.

Step S204: calculating a loss function value according to the positive sample and the selected negative sample, and updating the Faster R-CNN neural network according to the loss function value; and (4) network training is carried out again on the updated Faster R-CNN neural network until the updated Faster R-CNN neural network meets the preset convergence condition.

In particular, to reduce false detection problems caused by mutual occlusion between adjacent pedestrians, it is required that the candidate window should be closer to the pedestrian location associated therewith as calibrated in the data set. The traditional Faster R-CNN detection framework consists of two parts, namely a Regional Proposal Network (RPN) module and a Fast R-CNN module. The former is used to generate high quality candidate windows, while the latter is used to perform object classification and regression calculations on these candidate windows to better locate the object.

For the false detection problem caused by the occlusion of the adjacent pedestrian, in this embodiment, the loss function of the regional recommendation network (RPN) module is adjusted and redefined, and the loss function of the regional recommendation network (RPN) module is as shown in formula (5):

wherein i is the anchor box label, p_iAnd t_iThe prediction probability of the pedestrian and the corresponding prediction coordinate of the pedestrian are set as the ith anchor point frame;

and

labeling the object class associated with the ith anchor point frame and the corresponding calibration coordinates (here, a binary problem, the pedestrian class is 1, and the background class is 0); alpha is alpha₁To introduce the first hyperparameter, the two loss functions are weight adjusted.

To classify the pedestrian as a function of loss

As a function of polymerization loss.

The classification loss is estimated using a log loss function, which is defined as equation (6):

wherein the content of the first and second substances,N_clsthe total number of anchor boxes in the classification process.

In order to enable the RPN module to generate the correct candidate window more efficiently, the present invention introduces a new penalty function, called aggregation penalty function (aggregation loss), into the RPN module. The loss function not only enables the candidate windows to more accurately locate the annotation positions of the pedestrians associated with the candidate windows, but also reduces the distance between the candidate windows associated with the same pedestrian. The definition of the aggregation loss function is shown in equation (7):

wherein the content of the first and second substances,

the regression loss function is used for constraining the candidate window to enable the candidate window to be closer to a target calibration window; while

The candidate window is constrained to more compactly position the position of the target calibration object for a compactness loss function; β is a second hyperparameter for adjusting the weights of the two loss functions.

The invention uses a smooth L₁Loss function definition regression loss function

For measuring the accuracy of the predicted detection window, as shown in equation (8):

wherein N is_regThe total number of anchor boxes for the regression phase,

is about the predicted detection window t_iL of₁Loss value of the loss function.

Compactness loss function

For evaluating the confidence level of all candidate windows associated with the same labeled pedestrian. In particular, assume that

The pedestrian calibration window is a calibrated pedestrian sequence, and the pedestrian calibration windows are provided with anchor point frames associated with the pedestrian calibration window, namely at least one anchor point frame is intersected with the calibration window; { phi₁,...,Φ_pThe anchor point frame associated with the marked pedestrian window is marked with a sequence, i.e. for the mark phi_kThe anchor block of (1) is referred to by the reference numeral

Is associated with the pedestrian. Here, smooth L is used₁The loss function measures the error between the predicted position information of the anchor frame and the actually calibrated position information, and is used for describing the compactness between the predicted detection window and the actually calibrated window, and the specific form of the compactness loss function is shown as formula (9):

wherein N is_co_mFor the total number of pedestrians that intersect the anchor frame, | Φ_iL is the total number of anchor points associated with the ith calibration pedestrian, t_jMarking the coordinates corresponding to the predicted jth anchor point frame pedestrian, and p is the anchor point frame mark serial number associated with the calibrated pedestrian window, phi_pAnchor boxes associated with the calibrated pedestrian window are marked.

Meanwhile, in order to further improve the accuracy of window regression and strengthen the pedestrian detection capability of the model for the shielded environment, the invention also introduces an aggregation loss term into the loss function of the Fast R-CNN module, and the loss function is shown as a formula (10):

wherein alpha is₃Is the third hyperparameter and lambda is the fourth hyperparameter, and is a classification loss function

And the aggregate loss function

Is as defined in the RPN network,

the loss function is processed for occlusion as shown in equation (2). By introducing the aggregation loss item into the RPN module and the Fast R-CNN module of the pedestrian detector, the positioning capability of the detection window can be enhanced, and the overall detection performance is improved.

And then, iteratively updating the network parameters by using a random gradient descent method and back propagation errors until the training converges or the set maximum training times is reached to obtain the final network model parameters.

And in the testing stage, inputting the testing image into the trained network model for pedestrian detection, and outputting a detection result frame. Since the number of output detection frames is very large, most detection frames are screened out firstly by a confidence threshold value T of 0.05, and then the top N is selected according to the confidence_a400 detection frames. Then, using a non-maximum value inhibition method to remove repeated detection frames, and selecting the top N according to the confidence coefficient_bAnd (5) obtaining the final detection result by 200 detection frames.

Aiming at the pedestrian detection problem in a large-scale shielding environment, the accuracy of pedestrian detection is improved by introducing the shielding perception R-CNN model. Specifically, the invention designs a new aggregation loss function to reduce the false detection problem caused by the overlapping between adjacent pedestrians, and enables the candidate window to be positioned to the target pedestrian position more compactly and accurately; meanwhile, in order to solve the detection problem caused by occlusion, the invention designs a partitioned occlusion perception RoI pooling unit to replace an ROI pooling layer used in the traditional Fast R-CNN, and the pooling unit reduces the influence of occlusion on pedestrian detection by integrating visibility predicted values of different parts of a human body. When the convolutional neural network is trained, the pedestrian marking frame and the anchor point frame need to be matched, but under the existing matching strategy, the pedestrian marking frame with a certain scale cannot be matched with enough anchor point frames. Finally, the invention realizes the pedestrian detection method based on the blocking shielding perception, can efficiently and accurately detect the pedestrian in the image, and particularly remarkably improves the pedestrian detection capability in a large-scale shielding environment.

The invention further provides a pedestrian detection system based on block occlusion perception, and referring to fig. 4, fig. 4 exemplarily shows a schematic diagram of a pedestrian detection system based on block occlusion perception in the embodiment, and as shown in fig. 4, the system includes:

the detection result frame acquisition module is configured to acquire a plurality of detection result frames of the image of the pedestrian to be detected according to the overall characteristics acquired by the image characteristic fusion module;

In a preferred embodiment of the above pedestrian detection system based on blocking occlusion perception, the pedestrian detection system further includes a model training module, and the model training module includes:

the training image processing unit is configured to perform data amplification processing on a preset training image to obtain a training sample;

In the above preferred embodiment of the pedestrian detection system based on block occlusion perception, the Faster R-CNN neural network comprises an RPN module; in this case, the model training module is further configured to perform the following operations:

based on a preset training image and according to a loss function shown in a formula (11), carrying out network training on the RPN module:

wherein the content of the first and second substances,

in order to classify the loss function for the pedestrian,

and

the pedestrian classification loss function is shown in equation (12):

the polymerization loss function is shown in equation (13):

wherein the content of the first and second substances,

in order to be a function of the regression loss,

β is a second hyperparameter, which is a compactness loss function;

the regression loss function is shown in equation (14):

wherein N is_regThe total number of anchor boxes for the regression phase,

the compactness loss function is shown in equation (15):

In the above preferred embodiment of the pedestrian detection system based on block occlusion perception, the Faster R-CNN neural network comprises a Fast R-CNN module; in this case, the model training module is further configured to perform the following operations:

based on a preset training image and according to a loss function shown in a formula (16), performing network training on a Fast R-CNN module:

wherein the content of the first and second substances,

in order to classify the loss function for the pedestrian,

in order to be a function of the polymerization loss,

and

the pedestrian classification loss function is shown in equation (17):

the polymerization loss function is shown in equation (18):

wherein the content of the first and second substances,

in order to be a function of the regression loss,

β is a second hyperparameter, which is a compactness loss function;

the regression loss function is shown in equation (19):

wherein N is_regThe total number of anchor boxes for the regression phase,

the compactness loss function is shown in equation (20):

In a preferred embodiment of the above pedestrian detection system based on block occlusion perception, the positive and negative sample dividing unit includes:

the third matching subunit is configured to acquire the face labeling frames of which the matching number of the anchor frames is less than a preset second number, and select all the anchor frames which are subjected to intersection with each pedestrian labeling frame and have the superposition ratio greater than a preset second threshold; the preset first threshold is larger than a preset second threshold;

the fourth matching subunit is configured to select a preset third number of anchor frames to match with the corresponding pedestrian marking frames according to the sequence of the intersection and the superposition of all the selected anchor frames from large to small; the value of the preset third number is the average matching number of the anchor points of the human face labeling frames of which the matching number is greater than or equal to the preset second number.

Those of skill in the art will appreciate that the various illustrative systems and method steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate the interchangeability of electronic hardware and software. Whether such functionality is implemented as electronic hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The terms "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing or implying a particular order or sequence.

The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims

1. A pedestrian detection method based on block occlusion perception is characterized by comprising the following steps:

the pedestrian detection model is a model constructed based on an Faster R-CNN neural network, and an anchor point frame is associated in a high convolution layer of the Faster R-CNN neural network;

the Faster R-CNN neural network comprises an RPN module, and the method further comprises the step of carrying out network training on the RPN module based on a preset training image and according to a loss function shown in the following formula:

wherein the content of the first and second substances,

in order to classify the loss function for the pedestrian,

and

the pedestrian classification loss function is:

the aggregate loss function:

wherein the content of the first and second substances,

in order to be a function of the regression loss,

β is a second hyperparameter, which is a compactness loss function;

the regression loss function is:

wherein N is_regThe total number of anchor boxes for the regression phase,

the compactness loss function is:

2. The pedestrian detection method based on blocking and blocking perception according to claim 1, wherein before acquiring image features corresponding to each preset human body detection area based on a pre-constructed pedestrian detection model and according to a to-be-detected pedestrian image, the method further comprises:

performing data amplification processing on a preset training image to obtain a training sample;

3. The pedestrian detection method based on block occlusion perception according to claim 2, wherein the Faster R-CNN neural network further comprises a Fast R-CNN module; before "based on a pre-constructed pedestrian detection model, and according to a to-be-detected pedestrian image, acquiring image features corresponding to each preset human body detection area", the method further includes:

wherein the content of the first and second substances,

for occlusion handling of the loss function, α₃Is a third hyperparameter, λ is the secondAnd (4) four-over parameters.

4. The pedestrian detection method based on block occlusion perception according to claim 2 or 3, wherein the step of matching the anchor point frame with the pedestrian labeling frame in the training sample specifically comprises:

5. A pedestrian detection system based on blocking shielding perception is characterized by comprising:

the system also comprises a model training module, wherein the Faster R-CNN neural network comprises an RPN module; in this case, the model training module is configured to perform the following operations:

wherein the content of the first and second substances,

in order to classify the loss function for the pedestrian,

and

the pedestrian classification loss function is:

the aggregate loss function:

wherein the content of the first and second substances,

in order to be a function of the regression loss,

β is a second hyperparameter, which is a compactness loss function;

the regression loss function is:

wherein N is_regThe total number of anchor boxes for the regression phase,

the compactness loss function is:

6. The pedestrian detection system based on block occlusion perception according to claim 5, wherein the model training module comprises:

7. The block occlusion perception-based pedestrian detection system of claim 6, wherein the Faster R-CNN neural network comprises a Fast R-CNN module; in this case, the model training module is further configured to perform the following operations:

wherein the content of the first and second substances,

processing the loss function α for occlusion₃Is the third hyperparameter, and lambda is the fourth hyperparameter.

8. The pedestrian detection system based on block occlusion perception according to claim 6 or 7, wherein the positive and negative sample division unit comprises: