CN111461036A - Real-time pedestrian detection method using background modeling enhanced data - Google Patents

Real-time pedestrian detection method using background modeling enhanced data Download PDF

Info

Publication number
CN111461036A
CN111461036A CN202010263248.2A CN202010263248A CN111461036A CN 111461036 A CN111461036 A CN 111461036A CN 202010263248 A CN202010263248 A CN 202010263248A CN 111461036 A CN111461036 A CN 111461036A
Authority
CN
China
Prior art keywords
background
picture
pedestrians
convolution
detection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010263248.2A
Other languages
Chinese (zh)
Other versions
CN111461036B (en
Inventor
梁超
张圆通
王晓
吴佳乐
伍谦
白云鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN202010263248.2A priority Critical patent/CN111461036B/en
Publication of CN111461036A publication Critical patent/CN111461036A/en
Application granted granted Critical
Publication of CN111461036B publication Critical patent/CN111461036B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects

Abstract

The invention discloses a real-time pedestrian detection method utilizing background modeling enhanced data, which comprises the steps of firstly utilizing a background picture of monitoring data and a picture containing pedestrians to carry out background modeling, and carrying out binarization processing on the picture containing pedestrians to generate a mask map; then inputting a mask image and a picture containing pedestrians, and generating a saliency map by using a depth saliency detection network; converting the saliency map into a pseudo color map; finally, establishing a detection network, wherein the detection network comprises two completely symmetrical subnets A and B and a final target detection layer, training the detection network, respectively inputting pictures containing pedestrians and corresponding pseudo color images into the subnets A and B, finally inputting result images obtained by the subnets A and B into the target detection layer, and stopping training when a loss value tends to be constant; and finally, detecting the pedestrian picture to be detected by using the trained detection network.

Description

Real-time pedestrian detection method using background modeling enhanced data
Technical Field
The invention belongs to the field of computer vision, in particular to the field of pedestrian detection, and mainly utilizes background modeling and significance detection technology to enhance data and finally train a real-time pedestrian detection model.
Background
Pedestrian detection is a very important area of computer vision and is also the underlying technology for many intelligent devices. The pedestrian detection has great application value in the fields of security protection, intelligent transportation, automatic driving, trajectory tracking and the like.
In actual Detection, most of the mainstream Pedestrian Detection algorithms are optimized for a specific Detection object, namely a Pedestrian, on the basis of a general object detector, and the traditional Detection method mainly utilizes machine learning algorithms, such as ICF, Data g Tt translation =Ldata g Tt/T g DCF and the like, but the traditional Detection algorithm based on machine learning has the advantages that the Detection speed Is slow, the Detection rate Is High, the robustness of the model Is poor, the Detection algorithm Is difficult to be used in an actual production environment for ten years, the Detection accuracy and speed are improved and improved by using a huge Detection algorithm, the Detection efficiency and speed of the Pedestrian Detection are improved, and the Detection efficiency of the Pedestrian Detection algorithm Is improved by using a special Feature Detection model, especially a Detection algorithm for detecting pedestrians, and the Detection efficiency of Pedestrian Detection Is improved by using a special Detection algorithm-Detection algorithm, and the Detection algorithm Is improved by using a special Detection algorithm-Detection algorithm.
Although some methods can considerably improve the detection capability of the models by using multispectral data, a large number of color pictures and corresponding saliency maps or infrared images are required for training the models, the multispectral data are often difficult to obtain due to equipment limitations and the like, and a large number of mask maps accurate to the pixel level required for obtaining the saliency maps also need to be labeled manually by using a large amount of manpower. In addition to the problems of the training phase, the model cannot achieve the effect of real-time detection in the detection phase.
Disclosure of Invention
The invention mainly aims at the problems of the existing mixed network pedestrian detector: the multispectral training set required by the existing image enhancement or hybrid network technology is difficult to obtain, a large amount of manually labeled mask images are required for partial data, the trained model is often large and difficult to deploy, and the detection real-time performance cannot be guaranteed. Aiming at the problems, the invention provides a model which firstly utilizes background modeling and a saliency detection network to obtain a saliency map, and then the processed saliency map and an original detection picture are input into a mixed network together, so that the omission ratio is reduced.
The technical scheme of the invention is that a real-time pedestrian detection method utilizing background modeling to enhance data comprises the following steps:
step 1, performing background modeling by using a background picture of monitoring data and a picture containing pedestrians, and performing binarization processing on the picture containing the pedestrians to generate a mask map;
step 2, inputting a mask image and a picture containing pedestrians, and generating a saliency map by using a depth saliency detection network;
step 3, converting the saliency map into a pseudo color map;
step 4, establishing a detection network, wherein the detection network comprises two completely symmetrical subnets A and B and a final target detection layer, training the detection network, respectively inputting pictures containing pedestrians and corresponding pseudo color maps into the subnets A and B, finally inputting result maps obtained by the subnets A and B into the target detection layer, and stopping training when loss values tend to be constant, wherein the subnets A and B comprise parallel RBF modules and basic RFB-a modules, the specific processing process is as follows,
firstly, extracting feature maps of an original color picture and a corresponding pseudo color picture through a pre-trained VGG16 model, wherein a feature map conv4_2 obtained by twice convolution of a 4 th layer of VGG16 is sent to a BasiclFB _ a module, an output result is sent to a target detection layer, a feature map conv7_2 obtained by twice convolution of a 7 th layer of VGG16 is sent to the RFB module, and an output result is also sent to the target detection layer;
for the RFB module, firstly, 1x1 convolution is used for reducing the number of channels of the feature map, and the structure is three branches;
(1) continuing to perform 3x3 hole convolution, wherein the hole span is 1;
(2) performing 1x1 convolution, and then performing 3x3 hole convolution, wherein the hole span is 3;
(3) performing 5x5 convolution, and then performing 3x3 hole convolution, wherein the hole span is 5;
splicing the 3 branches, splicing with a result of the feature map conv4_2 extracted by the VGG16 after batch normalization, and sending the result into a Relu activation layer;
for the BasicRFB-a module, the number of channels of the feature map is reduced by using 1x1 convolution firstly, and the structure is 4 branches later;
(1a) continuing to perform 3x3 hole convolution, wherein the hole span is 1;
(2a) performing 1x3 convolution, and then performing 3x3 hole convolution, wherein the hole span is 3;
(3a) performing 3x1 convolution, and then performing 3x3 hole convolution, wherein the hole span is 3;
(4a) performing a 3x3 convolution, followed by a 3x3 hole convolution with a hole span of 5;
splicing the 4 branches, splicing with a result of the feature map conv7_2 extracted by the VGG16 after batch normalization, and sending the result into a Relu activation layer;
and 5, detecting the pedestrian picture to be detected by using the trained detection network.
Further, the specific implementation manner of step 1 is as follows,
step 1.1, preparing data, namely utilizing a picture containing pedestrians and a background picture which is shot in a monitoring mode and does not contain the pedestrians at the same position and has the same illumination condition, and marking a rectangular frame on a part containing the pedestrians;
step 1.2, comparing the background picture without pedestrians and the target picture with pedestrians pixel by pixel in a rectangular frame area with pedestrians, calculating Euclidean distance, and preliminarily judging whether each pixel point belongs to the pedestrians or the background according to the Euclidean distance;
step 1.3, a two-dimensional Gaussian distribution density function is introduced to estimate the probability that the position of a certain pixel point is a pedestrian, and a loss function is constructed by combining the Euclidean distance in the step 1.2;
step 1.4, recording the proportion of the pedestrian pixel points in the rectangular frame as mu%, sequentially calculating the loss function value of each pixel point, sequencing in a descending order, taking the first mu% of the pixel points to judge as the foreground, and judging other pixel points as the background;
step 1.5, correcting pixel points, if all the peripheral points of a certain point are judged to be foreground, the point does not belong to an edge area, and the point is judged to be background, correcting the peripheral points to be foreground, and finally obtaining a foreground area and a background area;
and step 1.6, performing binarization processing, namely adjusting all the pixel points determined as the background area to be pure white, and adjusting all the pixel points of the background area to be pure black to generate a mask image.
Further, the specific implementation manner of step 1.2 is as follows,
comparing RGB channel values of pixel points at corresponding positions of every two pictures, setting R, G and B channel values of a certain pixel point P of the background picture and the picture containing the pedestrian as R1, G1, B1, R2, G2 and B2 respectively,the pixel points are divided into two categories, belonging to the pedestrian part PPAnd a part P belonging to the backgroundb
If there is
Figure BDA0002440177060000041
Then P ∈ PP
Otherwise P ∈ Pb
Where t is the discrimination threshold.
Further, the specific implementation manner of introducing the two-dimensional Gaussian distribution density function to estimate the probability that the position of a certain pixel point is a pedestrian in the step 1.3 is as follows,
Figure BDA0002440177060000042
and i and j are coordinate positions of the pixel points, x and y are coordinates of the center point of the rectangular frame, and sigma w and sigma h are standard deviations of the width and the length of the rectangular frame.
Further, the overall loss function of step 1.3 is expressed as,
Figure BDA0002440177060000043
wherein the first term is Euclidean distance difference between a certain pixel point and a pixel normalized by a background picture, DmaxAnd a is an empirical coefficient for determining the weight, and actually, the contribution of the distance between the pixel point and the center and the difference between the distance and the background picture to the loss function are determined.
Further, in step 1.5, when there are multiple foreground regions under the same rectangular frame, to avoid the pedestrian being split, before the preliminary binarization processing, it is determined whether the multiple foreground regions marked under the same rectangular frame need to be merged, and the specific implementation manner is as follows,
traversing every two foreground areas in a rectangular frame to judge whether to combine the foreground areas, and judging ti,tj∈T,T represents a set of a plurality of foreground regions in a rectangular frame, rate is the average ratio of the width to the height of the rectangular frame, x and y are the central coordinate positions of the circumscribed rectangle of the background region, Tk.width,tkHeight is the width and height, t, respectively, of the circumscribed rectangle of the kth background regionk.x,tkY is the center coordinate of the bounding rectangle of the kth background area, W, H are the width and height of the rectangular frame, Dr,Dr’Judging a threshold value; if ti,tjAll satisfy
Figure BDA0002440177060000051
And is
Figure BDA0002440177060000052
Figure BDA0002440177060000053
Then two areas are merged, and the specific merging mode is as follows: and connecting a left lower point of the circumscribed rectangle of the upper partial region with a left upper point of the circumscribed rectangle of the lower partial region, connecting a right lower point of the circumscribed rectangle of the upper partial region with a right upper point of the circumscribed rectangle of the lower partial region, and completely filling the surrounded region into a foreground region.
Further, the significance detection network in the step 2 adopts PICA-net.
Compared with the prior art, the invention has the following advantages and beneficial effects:
(1) on the premise of ensuring the equivalent precision, the parameter quantity of the whole network is small, and the network is a light detection network and is convenient to transplant and deploy.
(2) Because the part of inputting the pseudo heat map provides richer semantic information, the detection model can still keep a low omission ratio under dim or poor lighting conditions.
(3) Compared with the traditional background modeling, the method does not depend on a specific background environment and has stronger universality.
Drawings
FIG. 1 is a flow chart of the detection according to an embodiment of the present invention;
FIG. 2 is an illustration of a data set used in an embodiment of the present invention, wherein the left side is a picture without a pedestrian and the right side is a picture with a pedestrian;
FIG. 3 is a network architecture diagram of a significance detection network;
fig. 4 is a comparison of background modeling and a significant detection network result model, where 1 is an original picture, 2 is a picture primarily generated by using background modeling, 3 is a significant picture generated by using PICA-net, and 4 is a pixel-level accurate ground route;
FIG. 5 is a pseudo color map generated by an embodiment of the present invention;
fig. 6 is a network structure diagram of subnets a and B according to an embodiment of the present invention.
Detailed Description
The following describes some details of the technical solution of the present invention in detail with reference to the flowchart (fig. 1).
Step 1, performing background modeling by using a background picture and a picture containing pedestrians;
step 1.1, data set preparation. Generally speaking, in a training set for pedestrian recognition and object detection, pedestrians in a training picture are marked by using rectangular frames, and relative position information is input into a training network, and a picture containing pedestrians and a background picture without pedestrians at the same position and under the same illumination condition are shot by monitoring, wherein the size of the picture containing pedestrians is the same as that of the background picture, as shown in fig. 2.
Step 1.2, comparing the background picture without pedestrians and the target picture with pedestrians pixel by pixel in a rectangular frame area with pedestrians, calculating Euclidean distance, and preliminarily judging whether each pixel point belongs to the pedestrians or the background according to the Euclidean distance;
specifically, RGB channel values of pixel points at corresponding positions of every two pictures are compared, R, G and B channel values of a certain pixel point P of a background picture and a certain pixel point P of a picture containing pedestrians are respectively set as R1, G1, B1, R2, G2 and B2, the pixel points are divided into two types, and the pixel points belong to the part P of the pedestriansPAnd a part P belonging to the backgroundbAnd the distinguishing threshold is t (the threshold t has no fixed value and needs to be determined according to different environments of the image in actual operation)
If there is
Figure BDA0002440177060000061
Then P ∈ PP
Otherwise P ∈ Pb
Step 1.3, a plurality of discontinuous parts of the initially processed image, wherein the part of the pixel points which are obviously the foreground (namely the pedestrian) part are judged as the background due to the fact that the color is close to the background picture, and some edge parts are judged as the pedestrian by mistake. Obviously, the attribution of the pixel is related to the relative position of the pixel in the detection frame, for example, the central part is almost certainly the pixel belonging to the pedestrian. Therefore, a two-dimensional Gaussian distribution density function is introduced to estimate the probability that the position of a certain pixel point is a foreground (pedestrian), and a loss function is constructed by combining the method for calculating the difference degree of the foreground and the background by utilizing the Euclidean distance.
Two-dimensional normal distribution is introduced to estimate the probability density function of a certain point belonging to the pedestrian, so that the probability that the pixel points closer to the edge position are judged as the pedestrian is smaller, otherwise, the probability that the pixel points closer to the center point are judged as the pedestrian is larger,
the method specifically comprises the following steps:
Figure BDA0002440177060000071
wherein i and j are the coordinate position of the pixel point P, x and y are the coordinates of the center point of the rectangular frame in step 1.1, and σ w and σ h are the standard deviations of the width and length of the rectangular frame. In addition to considering the difference from the background and the influence of the relative position distribution on the attribution of the pixel points, because the shapes of all parts of the pedestrian are not substantially different, the data set containing a pedestrian mask map can be used for being accurate to the pixel level, and in the experiment, a Daimler data set is used (the mask map (binary image) is arranged in the calibrated data set, and the pedestrian part is the pedestrian partWhite and black background, after all the pictures in the data set are uniformly scaled to the same size of the background image, the ratio of the number of times that each point at the same position in different pictures is judged as a pedestrian to the total number of the pictures in the data set is counted, so that the frequency W of each pixel point belonging to the pedestrian in the pictures can be obtained(i,j)). From this we can determine a loss function:
Figure BDA0002440177060000072
wherein the first term is Euclidean distance difference between a certain pixel point and a pixel normalized by a background picture, D represents the Euclidean distance of the pixel point, and DmaxThe maximum distance value of all pixels and the pixels of the background picture is represented by a formula (1), a is an empirical coefficient for determining weight, namely the contribution of the distance between the pixel and the center and the difference value between the distance between the pixel and the background picture to a loss function is actually determined, α under different background environments is different, in order to determine the coefficient, a Daimler pedestrian data set is utilized, the data set provides a pedestrian mask image accurate to the pixel level, α is tested from 0 to 10, the step length is 0.01, the average probability of each pixel judgment is finally compared, and finally α is determined to be 2.20.
Step 1.4, by using pedestrian labeling accurate to the pixel level, the area proportion occupied by pedestrians in a labeled rectangular frame can be easily calculated, the idea is only explained, different data sets or self-labeled data can be used in practice, the proportion occupied by pedestrian pixel points in the rectangular frame is recorded as mu%, the loss function value of each pixel point is sequentially calculated and sorted in a descending order, and the former mu% of the pixel points are taken to be judged as foreground (pedestrians) and other pixel points are judged as background.
Step 1.5, in order to make the pixel regions determined as "pedestrians" (foreground) continuous, if a certain pixel point is erroneously determined as "background" and all surrounding pixel points are determined as "pedestrians", and the point is not at an edge position, the point is also determined as "pedestrians". Specifically, on the test data set, we performed an erosion process on the central region of the image using a convolution of 3x 3. During specific operation, the size of the convolution kernel, the corrosion times and the size of the central area are properly adjusted according to the actual size of the image and the noise size.
If all points around a certain point are determined as pedestrians, the point does not belong to the edge area, and the point is determined as the background, the point needs to be corrected to be the foreground. That is, we want to eliminate the pixel islands (the surrounding pixels are all determined as background pixels of the foreground). Specifically, a region whose all surroundings are surrounded by the "foreground region" and whose area ratio is less than 2% is regarded as the foreground region, thereby obtaining the foreground region and the background region.
Step 1.6, considering that in some cases, the difference of colors of different body parts of some pedestrians is large, and the foreground region (such as the upper body and the legs) judged as a "pedestrian" may be completely split, at this time, the two regions need to be recombined. In addition, the problem of overlapping of partial rectangular frames of pedestrians is considered, for example, due to the scale difference caused by the distance, the rectangular frame of the pedestrian far away can be completely positioned in the detection frame of the pedestrian near to the detection frame, and in this case, the two regions do not need to be merged. The pictures processed by the steps are continuous areas of large blocks, small noise points do not exist, and the specific mode of judging whether each foreground area needs to be combined is as follows:
whether two-by-two traversal elements of multiple foreground regions within a rectangular box can be merged, e.g., for ti,tj∈ T, T denotes the set of multiple foreground regions within a rectangular box, rate is the average ratio of the width to height of the rectangular box x, y is the central coordinate position of the bounding rectangle of the background region Tk.width,tkHeight is the width and height, t, respectively, of the circumscribed rectangle of the kth background regionk.x,tkY is the center coordinate of the k-th background region bounding rectangle. W, H are the width and height of the rectangular frame, Dr,Dr’The threshold value can be determined according to specific situations.
If ti,tjAll satisfy
Figure BDA0002440177060000081
And is
Figure BDA0002440177060000082
Figure BDA0002440177060000083
Then two areas are merged, and the specific merging mode is as follows: and connecting the lower left point of the circumscribed rectangle of the upper partial region with the upper left point of the circumscribed rectangle of the lower partial region, connecting the lower right point of the circumscribed rectangle of the upper partial region with the upper right point of the circumscribed rectangle of the lower partial region, and filling all the surrounded regions into the foreground (pedestrians).
Namely, if the two regions are basically in the up-down position relationship, the left-right relative offset of the central point is not large, the respective aspect ratios do not accord with the scale proportion of the common pedestrian, and the combined regions accord with the scale proportion of the pedestrian, the two regions are combined, otherwise, the two regions are not combined, and the two regions are still judged as two pedestrians.
And step 1.7, carrying out binarization processing, namely adjusting all the pixel points determined as pedestrians into pure white (namely R, G and B are 255), adjusting the background area into pure black (namely R, G and B are 0), and generating a mask image.
And 2, training the significance detection network. The network can identify pedestrians (foreground) from the background. The training process of the network model is to input a mask image and an original picture color picture, use the original pedestrian picture as the input of the saliency monitoring network, use the mask image generated in the previous steps as the training target (as shown in fig. 3), and train, that is, let the network "learn" the approximate shape of the pedestrian.
We used PICA-net (PiCANet: L early Pixel-wise context protection for salience Detection) in the experiment, the structure of the network is similar to most of the semantic segmentation network, roughly divided into two parts, i.e. based on CNN for the encoder-decoder architecture (see fig. 3), the network structure generates an Attention map for each Pixel, where each Attention weight corresponds to the context relevance of each object, constructs a global Attention by selectively aggregating context information, which is the "Saliency" of each Pixel point, we compare the results of the background modeling with the results generated by the Saliency Detection network, verifying the validity of the Saliency Detection network (see fig. 4).
Step 3, the saliency map is converted into a pseudo color map (as shown in fig. 5), the thermodynamic map (heatmap) generally refers to a picture which reflects the temperature of a shot object and is shot by a thermal imaging camera, and the pseudo "thermal map" is also called a pseudo color map and refers to a mapping relation between a gray scale map (saliency map) and a color map.
And 4, establishing a detection network, wherein the detection network comprises two completely symmetrical subnets A and B and a final target detection layer, training the detection network, respectively inputting pictures containing pedestrians and corresponding pseudo color images into the subnets A and B, finally inputting result images obtained by the subnets A and B into the target detection layer, and stopping training when the loss value tends to be constant.
During model training, the input of subnet a is pre-labeled with a color map containing pedestrians, and the input of subnet B is the corresponding pseudo "heat map". Since the process of generating a pseudo "heat map" from a color map using a saliency detection network takes a long time, in order to ensure real-time performance without losing too high detection accuracy, when pedestrians are detected, the inputs of both sub-networks are original color detection maps containing pedestrians.
Specifically, the subnet A and the subnet B comprise parallel RBF modules and basic RFB-a modules, and the specific processing process is as follows;
for each sub-detection network A and B, the feature map of the image is extracted by the first half part by using a pre-training model VGG16 detection network on I L SVRC C L S-L OC, specifically, the feature map conv4_2 obtained by twice convolution of the layer 4 of VGG16 is sent to a BasicRFB _ a module and then sent to a detection layer, the feature map cov7_2 obtained by twice convolution of the layer 7 of VGG16 is sent to an RFB module, and the output result is also sent to the detection layer.
For the latter half of the sub-detection networks A and B, two detection networks with the same structure are used for further convolution processing, and finally the detection networks are sent to a target detection layer. The target detection layer (detection out layer) is consistent with the ssd (single Shot multi box detector) structure, is used for integrating three results of the pre-selection frame, the pre-selection frame offset and the score, and finally outputs a target detection frame meeting the conditions, the score of the target and the category of the target, wherein only two types are a pedestrian and a background. It should be noted that the input to the network is a pair of images (false color image and original detection picture), while the false color image output from the PICA-net has a fixed size (224x224) due to structural limitations, while the input size to the detection network is 300x 300. The original color picture and the "pseudo-color picture" need to be uniformly resized to 300x 300. Before being input into a network, an RFB at the front end part is a multi-branch conv block, the reception fields of different scales are obtained by using convolution kernels of different scales, and a plurality of generated branches are uniformly pooled. The specific structure of the RFB and BasicRFB-a modules is as follows (see FIG. 6):
for the RFB module, firstly, 1x1 convolution is used for reducing the number of channels of the feature map, and the structure is three branches;
(1) continuing to perform 3x3 hole convolution, wherein the hole span is 1;
(2) performing 1x1 convolution, and then performing 3x3 hole convolution, wherein the hole span is 3;
(3) performing 5x5 convolution, and then performing 3x3 hole convolution, wherein the hole span is 5;
and the 3 branches are spliced firstly, then spliced with the result of the feature diagram extracted by the VGG16 after batch normalization (Batchnorm), and then sent to the Relu activation function layer.
For the basicrrfb-a module, consistent with the RFB module, the number of channels of the profile is first reduced using 1x1 convolution, followed by a configuration of 4 branches.
(1) Continuing to perform 3x3 hole convolution, wherein the hole span is 1;
(2) performing 1x3 convolution, and then performing 3x3 hole convolution, wherein the hole span is 3;
(3) performing 3x1 convolution, and then performing 3x3 hole convolution, wherein the hole span is 3;
(4) performing a 3x3 convolution, followed by a 3x3 hole convolution with a hole span of 5;
and 4, splicing the 4 branches, splicing the spliced result with the result of the feature diagram extracted by the VGG16 after batch normalization (Batchnorm), and then sending the result into a Relu activation function layer.
And 4.1, firstly, performing data enhancement (turning, symmetry and the like) on the training set containing the color image and the pseudo color image of the pedestrian in the step. When training the test network model, since our test network model is trained from the beginning, we adjust the learning rate to 0.01 in the warm-up phase (the first ten epochs) and then reduce it to 0.00001 in order to make the model convergence faster. The loss function is consistent with the original SSD detection network, and when the loss value basically tends to be unchanged, the training is stopped.
And 4.2, testing the trained detection network model on a test set, and compared with the existing RFB-net, greatly reducing the omission factor of the detection network model provided by the invention (compared with the RFB-net, the omission factor is reduced from 50.9% to 16.2% on the same data set).
The described examples are intended to be illustrative, not limiting. Therefore, the present invention includes, but is not limited to, the examples described in the detailed description, and all other embodiments that can be derived from the technical solutions of the present invention by those skilled in the art also belong to the protection scope of the present invention.

Claims (7)

1. A real-time pedestrian detection method using background modeling enhanced data is characterized in that: comprises the following steps of (a) carrying out,
step 1, performing background modeling by using a background picture of monitoring data and a picture containing pedestrians, and performing binarization processing on the picture containing the pedestrians to generate a mask map;
step 2, inputting a mask image and a picture containing pedestrians, and generating a saliency map by using a depth saliency detection network;
step 3, converting the saliency map into a pseudo color map;
step 4, establishing a detection network, wherein the detection network comprises two completely symmetrical sub-networks A and B and a final target detection layer, training the detection network, respectively inputting pictures containing pedestrians and corresponding pseudo color maps into the sub-networks A and B, finally inputting result maps obtained by the sub-networks A and B into the target detection layer, and stopping training when loss values tend to be constant, wherein the sub-networks A and B comprise parallel RFB modules and BasicRFB-a modules, the specific processing process is as follows,
firstly, extracting feature maps of an original color picture and a corresponding pseudo color picture through a pre-trained VGG16 model, wherein a feature map conv4_2 obtained by twice convolution of a 4 th layer of VGG16 is sent to a BasiclFB _ a module, an output result is sent to a target detection layer, a feature map conv7_2 obtained by twice convolution of a 7 th layer of VGG16 is sent to the RFB module, and an output result is also sent to the target detection layer;
for the RFB module, firstly, 1x1 convolution is used for reducing the number of channels of the feature map, and the structure is three branches;
(1) continuing to perform 3x3 hole convolution, wherein the hole span is 1;
(2) performing 1x1 convolution, and then performing 3x3 hole convolution, wherein the hole span is 3;
(3) performing 5x5 convolution, and then performing 3x3 hole convolution, wherein the hole span is 5;
splicing the 3 branches, splicing with a result of the feature map conv4_2 extracted by the VGG16 after batch normalization, and sending the result into a Relu activation layer;
for the BasicRFB-a module, the number of channels of the feature map is reduced by using 1x1 convolution firstly, and the structure is 4 branches later;
(1a) continuing to perform 3x3 hole convolution, wherein the hole span is 1;
(2a) performing 1x3 convolution, and then performing 3x3 hole convolution, wherein the hole span is 3;
(3a) performing 3x1 convolution, and then performing 3x3 hole convolution, wherein the hole span is 3;
(4a) performing a 3x3 convolution, followed by a 3x3 hole convolution with a hole span of 5;
splicing the 4 branches, splicing with a result of a feature map covn7_2 extracted from VGG16 after batch normalization, and sending the result into a Relu activation layer;
and 5, detecting the pedestrian picture to be detected by using the trained detection network.
2. The method of real-time pedestrian detection using background modeling enhanced data according to claim 1, wherein: the specific implementation of step 1 is as follows,
step 1.1, preparing data, namely utilizing a picture containing pedestrians and a background picture which is the same in illumination condition and does not contain the pedestrians and is shot in a monitoring mode, and carrying out rectangular frame marking on the part containing the pedestrians, wherein the size of the picture containing the pedestrians is the same as that of the background picture;
step 1.2, comparing the background picture without pedestrians and the target picture with pedestrians pixel by pixel in a rectangular frame area with pedestrians, calculating Euclidean distance, and preliminarily judging whether each pixel point belongs to the pedestrians or the background according to the Euclidean distance;
step 1.3, a two-dimensional Gaussian distribution density function is introduced to estimate the probability that the position of a certain pixel point is a pedestrian, and a loss function is constructed by combining the Euclidean distance in the step 1.2;
step 1.4, recording the proportion of the pedestrian pixel points in the rectangular frame as mu%, sequentially calculating the loss function value of each pixel point, sequencing in a descending order, taking the first mu% of the pixel points to judge as the foreground, and judging other pixel points as the background;
step 1.5, correcting pixel points, if all the peripheral points of a certain point are judged to be foreground, the point does not belong to an edge area, and the point is judged to be background, correcting the peripheral points to be foreground, and finally obtaining a foreground area and a background area;
and step 1.6, performing binarization processing, namely adjusting all the pixel points determined as the background area to be pure white, and adjusting all the pixel points of the background area to be pure black to generate a mask image.
3. The method of real-time pedestrian detection using background modeling enhanced data according to claim 2, wherein: the specific implementation of step 1.2 is as follows,
comparing RGB channel values of pixel points at corresponding positions of every two pictures, setting R, G and B channel values of a certain pixel point P of a background picture and a picture containing pedestrians as R1, G1, B1, R2, G2 and B2 respectively, dividing the pixel points into two types, and enabling the pixel points to belong to a part P of the pedestriansPAnd a part P belonging to the backgroundb
If there is
Figure FDA0002440177050000021
Then P ∈ PP
Otherwise P ∈ Pb
Where t is the discrimination threshold.
4. The method of real-time pedestrian detection using background modeling enhanced data according to claim 3, wherein: the specific implementation manner of introducing the two-dimensional gaussian distribution density function to estimate the probability that the position of a certain pixel point is a pedestrian in step 1.3 is as follows,
Figure FDA0002440177050000031
and i and j are coordinate positions of the pixel points, x and y are coordinates of the center point of the rectangular frame, and sigma w and sigma h are standard deviations of the width and the length of the rectangular frame.
5. The method of real-time pedestrian detection using background modeling enhanced data according to claim 4, wherein: step 1.3 the overall loss function is expressed as,
Figure FDA0002440177050000032
wherein the first term is Euclidean distance difference between a certain pixel point and a pixel normalized by a background picture, DmaxRepresenting the maximum value of the distances between all pixel points and the pixels of the background picture, wherein a is an empirical coefficient for determining the weight, and actually determining the contribution of the distance between the pixel points and the center and the difference between the distance and the background picture to the loss function; after the Daimler data sets are uniformly scaled to be the same as the background pictures in size, the ratio of the number of times that each pixel point at the same position in different pictures is judged as a pedestrian to the total number of pictures in the Daimler data sets is counted, and therefore the frequency W of each pixel point belonging to the pedestrian is obtained(i,j)
6. The method of real-time pedestrian detection using background modeling enhanced data according to claim 2, wherein: in step 1.5, a plurality of foreground regions exist under the same rectangular frame, and in order to avoid the situation that pedestrians are cracked, whether the plurality of foreground regions marked under the same rectangular frame need to be combined or not is judged before binary output processing is carried out, and the specific implementation mode is as follows,
performing pairwise traversal on a plurality of foreground regions in a rectangular frame to judge whether to merge, and for ti,tj∈ T, T represents the set of multiple foreground regions in a rectangular frame, rate is the average ratio of the width to the height of the rectangular frame, x, y are the central coordinate positions of the bounding rectangle of the background region, Tk.width,tkHeight is the width and height, t, respectively, of the circumscribed rectangle of the kth background regionk.x,tkY is the center coordinate of the bounding rectangle of the kth background area, W, H are the width and height of the rectangular frame, Dr,Dr’Judging a threshold value; if ti,tjAll satisfy
Figure FDA0002440177050000041
And is
Figure FDA0002440177050000042
Figure FDA0002440177050000043
Then two areas are merged, and the specific merging mode is as follows: and connecting a left lower point of the circumscribed rectangle of the upper partial region with a left upper point of the circumscribed rectangle of the lower partial region, connecting a right lower point of the circumscribed rectangle of the upper partial region with a right upper point of the circumscribed rectangle of the lower partial region, and completely filling the surrounded region into a foreground region.
7. The method of real-time pedestrian detection using background modeling enhanced data according to claim 1, wherein: the significance detection network in the step 2 adopts PICA-net.
CN202010263248.2A 2020-04-07 2020-04-07 Real-time pedestrian detection method using background modeling to enhance data Active CN111461036B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010263248.2A CN111461036B (en) 2020-04-07 2020-04-07 Real-time pedestrian detection method using background modeling to enhance data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010263248.2A CN111461036B (en) 2020-04-07 2020-04-07 Real-time pedestrian detection method using background modeling to enhance data

Publications (2)

Publication Number Publication Date
CN111461036A true CN111461036A (en) 2020-07-28
CN111461036B CN111461036B (en) 2022-07-05

Family

ID=71685893

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010263248.2A Active CN111461036B (en) 2020-04-07 2020-04-07 Real-time pedestrian detection method using background modeling to enhance data

Country Status (1)

Country Link
CN (1) CN111461036B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112084874A (en) * 2020-08-11 2020-12-15 深圳市优必选科技股份有限公司 Object detection method and device and terminal equipment
CN112308114A (en) * 2020-09-24 2021-02-02 赣州好朋友科技有限公司 Method and device for sorting scheelite and readable storage medium
CN112785582A (en) * 2021-01-29 2021-05-11 北京百度网讯科技有限公司 Training method and device for thermodynamic diagram generation model, electronic equipment and storage medium
CN112907616A (en) * 2021-04-27 2021-06-04 浙江大学 Pedestrian detection method based on thermal imaging background filtering

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103136511A (en) * 2013-01-21 2013-06-05 信帧电子技术(北京)有限公司 Behavior detection method and behavior detection device
CN103530879A (en) * 2013-10-15 2014-01-22 无锡清华信息科学与技术国家实验室物联网技术中心 Pedestrian color extraction method under specific scene
CN103700114A (en) * 2012-09-27 2014-04-02 中国航天科工集团第二研究院二O七所 Complex background modeling method based on variable Gaussian mixture number
KR101518485B1 (en) * 2013-11-29 2015-05-11 김홍기 Intelligent object tracking system
CN105139368A (en) * 2015-08-12 2015-12-09 旗瀚科技股份有限公司 Hybrid tone mapping method for machine vision
CN105550678A (en) * 2016-02-03 2016-05-04 武汉大学 Human body motion feature extraction method based on global remarkable edge area

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103700114A (en) * 2012-09-27 2014-04-02 中国航天科工集团第二研究院二O七所 Complex background modeling method based on variable Gaussian mixture number
CN103136511A (en) * 2013-01-21 2013-06-05 信帧电子技术(北京)有限公司 Behavior detection method and behavior detection device
CN103530879A (en) * 2013-10-15 2014-01-22 无锡清华信息科学与技术国家实验室物联网技术中心 Pedestrian color extraction method under specific scene
KR101518485B1 (en) * 2013-11-29 2015-05-11 김홍기 Intelligent object tracking system
CN105139368A (en) * 2015-08-12 2015-12-09 旗瀚科技股份有限公司 Hybrid tone mapping method for machine vision
CN105550678A (en) * 2016-02-03 2016-05-04 武汉大学 Human body motion feature extraction method based on global remarkable edge area

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
SARTHAK GUPTA等: "GPOL: Gradient and Probabilistic approach for Object Localization to understand the working of CNNs", 《2019 IEEE BOMBAY SECTION SIGNATURE CONFERENCE (IBSSC)》 *
SONGTAO LIU等: "Receptive Field Block Net for Accurate and Fast Object Detection", 《15TH EUROPEAN CONFERENCE ON COMPUTER VISION (ECCV)》 *
王伟锋 等: "基于感受野的快速小目标检测算法", 《激光与光电子学进展》 *
黎宁 等: "视觉注意机制下结合语义特征的行人检测", 《中国图象图形学报》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112084874A (en) * 2020-08-11 2020-12-15 深圳市优必选科技股份有限公司 Object detection method and device and terminal equipment
CN112084874B (en) * 2020-08-11 2023-12-29 深圳市优必选科技股份有限公司 Object detection method and device and terminal equipment
CN112308114A (en) * 2020-09-24 2021-02-02 赣州好朋友科技有限公司 Method and device for sorting scheelite and readable storage medium
CN112785582A (en) * 2021-01-29 2021-05-11 北京百度网讯科技有限公司 Training method and device for thermodynamic diagram generation model, electronic equipment and storage medium
CN112785582B (en) * 2021-01-29 2024-03-22 北京百度网讯科技有限公司 Training method and device for thermodynamic diagram generation model, electronic equipment and storage medium
CN112907616A (en) * 2021-04-27 2021-06-04 浙江大学 Pedestrian detection method based on thermal imaging background filtering
CN112907616B (en) * 2021-04-27 2022-05-03 浙江大学 Pedestrian detection method based on thermal imaging background filtering

Also Published As

Publication number Publication date
CN111461036B (en) 2022-07-05

Similar Documents

Publication Publication Date Title
CN111461036B (en) Real-time pedestrian detection method using background modeling to enhance data
CN108961235B (en) Defective insulator identification method based on YOLOv3 network and particle filter algorithm
CN111612763B (en) Mobile phone screen defect detection method, device and system, computer equipment and medium
CN113160192B (en) Visual sense-based snow pressing vehicle appearance defect detection method and device under complex background
CN111640157B (en) Checkerboard corner detection method based on neural network and application thereof
CN105631880B (en) Lane line dividing method and device
CN110276264B (en) Crowd density estimation method based on foreground segmentation graph
CN112241699A (en) Object defect category identification method and device, computer equipment and storage medium
CN108197604A (en) Fast face positioning and tracing method based on embedded device
CN111008632B (en) License plate character segmentation method based on deep learning
CN106355607B (en) A kind of width baseline color image template matching method
CN111310756A (en) Damaged corn particle detection and classification method based on deep learning
CN112561899A (en) Electric power inspection image identification method
CN110866915A (en) Circular inkstone quality detection method based on metric learning
CN115272204A (en) Bearing surface scratch detection method based on machine vision
CN111127384A (en) Strong reflection workpiece vision measurement method based on polarization imaging
Liu et al. D-CenterNet: An anchor-free detector with knowledge distillation for industrial defect detection
CN114445661B (en) Embedded image recognition method based on edge calculation
US11588955B2 (en) Apparatus, method, and computer program for image conversion
CN109815957A (en) A kind of character recognition method based on color image under complex background
TW202319959A (en) Image recognition system and training method thereof
CN112304512A (en) Multi-workpiece scene air tightness detection method and system based on artificial intelligence
CN112949510A (en) Human detection method based on fast R-CNN thermal infrared image
CN111062384B (en) Vehicle window accurate positioning method based on deep learning
US20240005469A1 (en) Defect detection method and apparatus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant