CN111461036A

CN111461036A - Real-time pedestrian detection method using background modeling enhanced data

Info

Publication number: CN111461036A
Application number: CN202010263248.2A
Authority: CN
Inventors: 梁超; 张圆通; 王晓; 吴佳乐; 伍谦; 白云鹏
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2020-04-07
Filing date: 2020-04-07
Publication date: 2020-07-28
Anticipated expiration: 2040-04-07
Also published as: CN111461036B

Abstract

The invention discloses a real-time pedestrian detection method utilizing background modeling enhanced data, which comprises the steps of firstly utilizing a background picture of monitoring data and a picture containing pedestrians to carry out background modeling, and carrying out binarization processing on the picture containing pedestrians to generate a mask map; then inputting a mask image and a picture containing pedestrians, and generating a saliency map by using a depth saliency detection network; converting the saliency map into a pseudo color map; finally, establishing a detection network, wherein the detection network comprises two completely symmetrical subnets A and B and a final target detection layer, training the detection network, respectively inputting pictures containing pedestrians and corresponding pseudo color images into the subnets A and B, finally inputting result images obtained by the subnets A and B into the target detection layer, and stopping training when a loss value tends to be constant; and finally, detecting the pedestrian picture to be detected by using the trained detection network.

Description

Real-time pedestrian detection method using background modeling enhanced data

Technical Field

The invention belongs to the field of computer vision, in particular to the field of pedestrian detection, and mainly utilizes background modeling and significance detection technology to enhance data and finally train a real-time pedestrian detection model.

Background

Pedestrian detection is a very important area of computer vision and is also the underlying technology for many intelligent devices. The pedestrian detection has great application value in the fields of security protection, intelligent transportation, automatic driving, trajectory tracking and the like.

In actual Detection, most of the mainstream Pedestrian Detection algorithms are optimized for a specific Detection object, namely a Pedestrian, on the basis of a general object detector, and the traditional Detection method mainly utilizes machine learning algorithms, such as ICF, Data g Tt translation =Ldata g Tt/T g DCF and the like, but the traditional Detection algorithm based on machine learning has the advantages that the Detection speed Is slow, the Detection rate Is High, the robustness of the model Is poor, the Detection algorithm Is difficult to be used in an actual production environment for ten years, the Detection accuracy and speed are improved and improved by using a huge Detection algorithm, the Detection efficiency and speed of the Pedestrian Detection are improved, and the Detection efficiency of the Pedestrian Detection algorithm Is improved by using a special Feature Detection model, especially a Detection algorithm for detecting pedestrians, and the Detection efficiency of Pedestrian Detection Is improved by using a special Detection algorithm-Detection algorithm, and the Detection algorithm Is improved by using a special Detection algorithm-Detection algorithm.

Although some methods can considerably improve the detection capability of the models by using multispectral data, a large number of color pictures and corresponding saliency maps or infrared images are required for training the models, the multispectral data are often difficult to obtain due to equipment limitations and the like, and a large number of mask maps accurate to the pixel level required for obtaining the saliency maps also need to be labeled manually by using a large amount of manpower. In addition to the problems of the training phase, the model cannot achieve the effect of real-time detection in the detection phase.

Disclosure of Invention

The invention mainly aims at the problems of the existing mixed network pedestrian detector: the multispectral training set required by the existing image enhancement or hybrid network technology is difficult to obtain, a large amount of manually labeled mask images are required for partial data, the trained model is often large and difficult to deploy, and the detection real-time performance cannot be guaranteed. Aiming at the problems, the invention provides a model which firstly utilizes background modeling and a saliency detection network to obtain a saliency map, and then the processed saliency map and an original detection picture are input into a mixed network together, so that the omission ratio is reduced.

The technical scheme of the invention is that a real-time pedestrian detection method utilizing background modeling to enhance data comprises the following steps:

step 1, performing background modeling by using a background picture of monitoring data and a picture containing pedestrians, and performing binarization processing on the picture containing the pedestrians to generate a mask map;

step 2, inputting a mask image and a picture containing pedestrians, and generating a saliency map by using a depth saliency detection network;

step 3, converting the saliency map into a pseudo color map;

step 4, establishing a detection network, wherein the detection network comprises two completely symmetrical subnets A and B and a final target detection layer, training the detection network, respectively inputting pictures containing pedestrians and corresponding pseudo color maps into the subnets A and B, finally inputting result maps obtained by the subnets A and B into the target detection layer, and stopping training when loss values tend to be constant, wherein the subnets A and B comprise parallel RBF modules and basic RFB-a modules, the specific processing process is as follows,

firstly, extracting feature maps of an original color picture and a corresponding pseudo color picture through a pre-trained VGG16 model, wherein a feature map conv4_2 obtained by twice convolution of a 4 th layer of VGG16 is sent to a BasiclFB _ a module, an output result is sent to a target detection layer, a feature map conv7_2 obtained by twice convolution of a 7 th layer of VGG16 is sent to the RFB module, and an output result is also sent to the target detection layer;

for the RFB module, firstly, 1x1 convolution is used for reducing the number of channels of the feature map, and the structure is three branches;

(1) continuing to perform 3x3 hole convolution, wherein the hole span is 1;

(2) performing 1x1 convolution, and then performing 3x3 hole convolution, wherein the hole span is 3;

(3) performing 5x5 convolution, and then performing 3x3 hole convolution, wherein the hole span is 5;

splicing the 3 branches, splicing with a result of the feature map conv4_2 extracted by the VGG16 after batch normalization, and sending the result into a Relu activation layer;

for the BasicRFB-a module, the number of channels of the feature map is reduced by using 1x1 convolution firstly, and the structure is 4 branches later;

(1a) continuing to perform 3x3 hole convolution, wherein the hole span is 1;

(2a) performing 1x3 convolution, and then performing 3x3 hole convolution, wherein the hole span is 3;

(3a) performing 3x1 convolution, and then performing 3x3 hole convolution, wherein the hole span is 3;

(4a) performing a 3x3 convolution, followed by a 3x3 hole convolution with a hole span of 5;

splicing the 4 branches, splicing with a result of the feature map conv7_2 extracted by the VGG16 after batch normalization, and sending the result into a Relu activation layer;

and 5, detecting the pedestrian picture to be detected by using the trained detection network.

Further, the specific implementation manner of step 1 is as follows,

step 1.1, preparing data, namely utilizing a picture containing pedestrians and a background picture which is shot in a monitoring mode and does not contain the pedestrians at the same position and has the same illumination condition, and marking a rectangular frame on a part containing the pedestrians;

step 1.2, comparing the background picture without pedestrians and the target picture with pedestrians pixel by pixel in a rectangular frame area with pedestrians, calculating Euclidean distance, and preliminarily judging whether each pixel point belongs to the pedestrians or the background according to the Euclidean distance;

step 1.3, a two-dimensional Gaussian distribution density function is introduced to estimate the probability that the position of a certain pixel point is a pedestrian, and a loss function is constructed by combining the Euclidean distance in the step 1.2;

step 1.4, recording the proportion of the pedestrian pixel points in the rectangular frame as mu%, sequentially calculating the loss function value of each pixel point, sequencing in a descending order, taking the first mu% of the pixel points to judge as the foreground, and judging other pixel points as the background;

step 1.5, correcting pixel points, if all the peripheral points of a certain point are judged to be foreground, the point does not belong to an edge area, and the point is judged to be background, correcting the peripheral points to be foreground, and finally obtaining a foreground area and a background area;

and step 1.6, performing binarization processing, namely adjusting all the pixel points determined as the background area to be pure white, and adjusting all the pixel points of the background area to be pure black to generate a mask image.

Further, the specific implementation manner of step 1.2 is as follows,

comparing RGB channel values of pixel points at corresponding positions of every two pictures, setting R, G and B channel values of a certain pixel point P of the background picture and the picture containing the pedestrian as R1, G1, B1, R2, G2 and B2 respectively,the pixel points are divided into two categories, belonging to the pedestrian part P_PAnd a part P belonging to the background_b，

If there is

Then P ∈ P_P

Otherwise P ∈ P_b

Where t is the discrimination threshold.

Further, the specific implementation manner of introducing the two-dimensional Gaussian distribution density function to estimate the probability that the position of a certain pixel point is a pedestrian in the step 1.3 is as follows,

and i and j are coordinate positions of the pixel points, x and y are coordinates of the center point of the rectangular frame, and sigma w and sigma h are standard deviations of the width and the length of the rectangular frame.

Further, the overall loss function of step 1.3 is expressed as,

wherein the first term is Euclidean distance difference between a certain pixel point and a pixel normalized by a background picture, D_maxAnd a is an empirical coefficient for determining the weight, and actually, the contribution of the distance between the pixel point and the center and the difference between the distance and the background picture to the loss function are determined.

Further, in step 1.5, when there are multiple foreground regions under the same rectangular frame, to avoid the pedestrian being split, before the preliminary binarization processing, it is determined whether the multiple foreground regions marked under the same rectangular frame need to be merged, and the specific implementation manner is as follows,

traversing every two foreground areas in a rectangular frame to judge whether to combine the foreground areas, and judging t_i，t_j∈T，T represents a set of a plurality of foreground regions in a rectangular frame, rate is the average ratio of the width to the height of the rectangular frame, x and y are the central coordinate positions of the circumscribed rectangle of the background region, T_k.width,t_kHeight is the width and height, t, respectively, of the circumscribed rectangle of the kth background region_k.x,t_kY is the center coordinate of the bounding rectangle of the kth background area, W, H are the width and height of the rectangular frame, D_r，D_r’Judging a threshold value; if t_i，t_jAll satisfy

And is

Then two areas are merged, and the specific merging mode is as follows: and connecting a left lower point of the circumscribed rectangle of the upper partial region with a left upper point of the circumscribed rectangle of the lower partial region, connecting a right lower point of the circumscribed rectangle of the upper partial region with a right upper point of the circumscribed rectangle of the lower partial region, and completely filling the surrounded region into a foreground region.

Further, the significance detection network in the step 2 adopts PICA-net.

Compared with the prior art, the invention has the following advantages and beneficial effects:

(1) on the premise of ensuring the equivalent precision, the parameter quantity of the whole network is small, and the network is a light detection network and is convenient to transplant and deploy.

(2) Because the part of inputting the pseudo heat map provides richer semantic information, the detection model can still keep a low omission ratio under dim or poor lighting conditions.

(3) Compared with the traditional background modeling, the method does not depend on a specific background environment and has stronger universality.

Drawings

FIG. 1 is a flow chart of the detection according to an embodiment of the present invention;

FIG. 2 is an illustration of a data set used in an embodiment of the present invention, wherein the left side is a picture without a pedestrian and the right side is a picture with a pedestrian;

FIG. 3 is a network architecture diagram of a significance detection network;

fig. 4 is a comparison of background modeling and a significant detection network result model, where 1 is an original picture, 2 is a picture primarily generated by using background modeling, 3 is a significant picture generated by using PICA-net, and 4 is a pixel-level accurate ground route;

FIG. 5 is a pseudo color map generated by an embodiment of the present invention;

fig. 6 is a network structure diagram of subnets a and B according to an embodiment of the present invention.

Detailed Description

The following describes some details of the technical solution of the present invention in detail with reference to the flowchart (fig. 1).

Step 1, performing background modeling by using a background picture and a picture containing pedestrians;

step 1.1, data set preparation. Generally speaking, in a training set for pedestrian recognition and object detection, pedestrians in a training picture are marked by using rectangular frames, and relative position information is input into a training network, and a picture containing pedestrians and a background picture without pedestrians at the same position and under the same illumination condition are shot by monitoring, wherein the size of the picture containing pedestrians is the same as that of the background picture, as shown in fig. 2.

specifically, RGB channel values of pixel points at corresponding positions of every two pictures are compared, R, G and B channel values of a certain pixel point P of a background picture and a certain pixel point P of a picture containing pedestrians are respectively set as R1, G1, B1, R2, G2 and B2, the pixel points are divided into two types, and the pixel points belong to the part P of the pedestrians_PAnd a part P belonging to the background_bAnd the distinguishing threshold is t (the threshold t has no fixed value and needs to be determined according to different environments of the image in actual operation)

If there is

Then P ∈ P_P

Otherwise P ∈ P_b

Step 1.3, a plurality of discontinuous parts of the initially processed image, wherein the part of the pixel points which are obviously the foreground (namely the pedestrian) part are judged as the background due to the fact that the color is close to the background picture, and some edge parts are judged as the pedestrian by mistake. Obviously, the attribution of the pixel is related to the relative position of the pixel in the detection frame, for example, the central part is almost certainly the pixel belonging to the pedestrian. Therefore, a two-dimensional Gaussian distribution density function is introduced to estimate the probability that the position of a certain pixel point is a foreground (pedestrian), and a loss function is constructed by combining the method for calculating the difference degree of the foreground and the background by utilizing the Euclidean distance.

Two-dimensional normal distribution is introduced to estimate the probability density function of a certain point belonging to the pedestrian, so that the probability that the pixel points closer to the edge position are judged as the pedestrian is smaller, otherwise, the probability that the pixel points closer to the center point are judged as the pedestrian is larger,

the method specifically comprises the following steps:

wherein i and j are the coordinate position of the pixel point P, x and y are the coordinates of the center point of the rectangular frame in step 1.1, and σ w and σ h are the standard deviations of the width and length of the rectangular frame. In addition to considering the difference from the background and the influence of the relative position distribution on the attribution of the pixel points, because the shapes of all parts of the pedestrian are not substantially different, the data set containing a pedestrian mask map can be used for being accurate to the pixel level, and in the experiment, a Daimler data set is used (the mask map (binary image) is arranged in the calibrated data set, and the pedestrian part is the pedestrian partWhite and black background, after all the pictures in the data set are uniformly scaled to the same size of the background image, the ratio of the number of times that each point at the same position in different pictures is judged as a pedestrian to the total number of the pictures in the data set is counted, so that the frequency W of each pixel point belonging to the pedestrian in the pictures can be obtained_(i,j)). From this we can determine a loss function:

wherein the first term is Euclidean distance difference between a certain pixel point and a pixel normalized by a background picture, D represents the Euclidean distance of the pixel point, and D_maxThe maximum distance value of all pixels and the pixels of the background picture is represented by a formula (1), a is an empirical coefficient for determining weight, namely the contribution of the distance between the pixel and the center and the difference value between the distance between the pixel and the background picture to a loss function is actually determined, α under different background environments is different, in order to determine the coefficient, a Daimler pedestrian data set is utilized, the data set provides a pedestrian mask image accurate to the pixel level, α is tested from 0 to 10, the step length is 0.01, the average probability of each pixel judgment is finally compared, and finally α is determined to be 2.20.

Step 1.4, by using pedestrian labeling accurate to the pixel level, the area proportion occupied by pedestrians in a labeled rectangular frame can be easily calculated, the idea is only explained, different data sets or self-labeled data can be used in practice, the proportion occupied by pedestrian pixel points in the rectangular frame is recorded as mu%, the loss function value of each pixel point is sequentially calculated and sorted in a descending order, and the former mu% of the pixel points are taken to be judged as foreground (pedestrians) and other pixel points are judged as background.

Step 1.5, in order to make the pixel regions determined as "pedestrians" (foreground) continuous, if a certain pixel point is erroneously determined as "background" and all surrounding pixel points are determined as "pedestrians", and the point is not at an edge position, the point is also determined as "pedestrians". Specifically, on the test data set, we performed an erosion process on the central region of the image using a convolution of 3x 3. During specific operation, the size of the convolution kernel, the corrosion times and the size of the central area are properly adjusted according to the actual size of the image and the noise size.

If all points around a certain point are determined as pedestrians, the point does not belong to the edge area, and the point is determined as the background, the point needs to be corrected to be the foreground. That is, we want to eliminate the pixel islands (the surrounding pixels are all determined as background pixels of the foreground). Specifically, a region whose all surroundings are surrounded by the "foreground region" and whose area ratio is less than 2% is regarded as the foreground region, thereby obtaining the foreground region and the background region.

Step 1.6, considering that in some cases, the difference of colors of different body parts of some pedestrians is large, and the foreground region (such as the upper body and the legs) judged as a "pedestrian" may be completely split, at this time, the two regions need to be recombined. In addition, the problem of overlapping of partial rectangular frames of pedestrians is considered, for example, due to the scale difference caused by the distance, the rectangular frame of the pedestrian far away can be completely positioned in the detection frame of the pedestrian near to the detection frame, and in this case, the two regions do not need to be merged. The pictures processed by the steps are continuous areas of large blocks, small noise points do not exist, and the specific mode of judging whether each foreground area needs to be combined is as follows:

whether two-by-two traversal elements of multiple foreground regions within a rectangular box can be merged, e.g., for t_i，t_j∈ T, T denotes the set of multiple foreground regions within a rectangular box, rate is the average ratio of the width to height of the rectangular box x, y is the central coordinate position of the bounding rectangle of the background region T_k.width,t_kHeight is the width and height, t, respectively, of the circumscribed rectangle of the kth background region_k.x,t_kY is the center coordinate of the k-th background region bounding rectangle. W, H are the width and height of the rectangular frame, D_r，D_r’The threshold value can be determined according to specific situations.

If t_i，t_jAll satisfy

And is

Then two areas are merged, and the specific merging mode is as follows: and connecting the lower left point of the circumscribed rectangle of the upper partial region with the upper left point of the circumscribed rectangle of the lower partial region, connecting the lower right point of the circumscribed rectangle of the upper partial region with the upper right point of the circumscribed rectangle of the lower partial region, and filling all the surrounded regions into the foreground (pedestrians).

Namely, if the two regions are basically in the up-down position relationship, the left-right relative offset of the central point is not large, the respective aspect ratios do not accord with the scale proportion of the common pedestrian, and the combined regions accord with the scale proportion of the pedestrian, the two regions are combined, otherwise, the two regions are not combined, and the two regions are still judged as two pedestrians.

And step 1.7, carrying out binarization processing, namely adjusting all the pixel points determined as pedestrians into pure white (namely R, G and B are 255), adjusting the background area into pure black (namely R, G and B are 0), and generating a mask image.

And 2, training the significance detection network. The network can identify pedestrians (foreground) from the background. The training process of the network model is to input a mask image and an original picture color picture, use the original pedestrian picture as the input of the saliency monitoring network, use the mask image generated in the previous steps as the training target (as shown in fig. 3), and train, that is, let the network "learn" the approximate shape of the pedestrian.

We used PICA-net (PiCANet: L early Pixel-wise context protection for salience Detection) in the experiment, the structure of the network is similar to most of the semantic segmentation network, roughly divided into two parts, i.e. based on CNN for the encoder-decoder architecture (see fig. 3), the network structure generates an Attention map for each Pixel, where each Attention weight corresponds to the context relevance of each object, constructs a global Attention by selectively aggregating context information, which is the "Saliency" of each Pixel point, we compare the results of the background modeling with the results generated by the Saliency Detection network, verifying the validity of the Saliency Detection network (see fig. 4).

Step 3, the saliency map is converted into a pseudo color map (as shown in fig. 5), the thermodynamic map (heatmap) generally refers to a picture which reflects the temperature of a shot object and is shot by a thermal imaging camera, and the pseudo "thermal map" is also called a pseudo color map and refers to a mapping relation between a gray scale map (saliency map) and a color map.

And 4, establishing a detection network, wherein the detection network comprises two completely symmetrical subnets A and B and a final target detection layer, training the detection network, respectively inputting pictures containing pedestrians and corresponding pseudo color images into the subnets A and B, finally inputting result images obtained by the subnets A and B into the target detection layer, and stopping training when the loss value tends to be constant.

During model training, the input of subnet a is pre-labeled with a color map containing pedestrians, and the input of subnet B is the corresponding pseudo "heat map". Since the process of generating a pseudo "heat map" from a color map using a saliency detection network takes a long time, in order to ensure real-time performance without losing too high detection accuracy, when pedestrians are detected, the inputs of both sub-networks are original color detection maps containing pedestrians.

Specifically, the subnet A and the subnet B comprise parallel RBF modules and basic RFB-a modules, and the specific processing process is as follows;

for each sub-detection network A and B, the feature map of the image is extracted by the first half part by using a pre-training model VGG16 detection network on I L SVRC C L S-L OC, specifically, the feature map conv4_2 obtained by twice convolution of the layer 4 of VGG16 is sent to a BasicRFB _ a module and then sent to a detection layer, the feature map cov7_2 obtained by twice convolution of the layer 7 of VGG16 is sent to an RFB module, and the output result is also sent to the detection layer.

For the latter half of the sub-detection networks A and B, two detection networks with the same structure are used for further convolution processing, and finally the detection networks are sent to a target detection layer. The target detection layer (detection out layer) is consistent with the ssd (single Shot multi box detector) structure, is used for integrating three results of the pre-selection frame, the pre-selection frame offset and the score, and finally outputs a target detection frame meeting the conditions, the score of the target and the category of the target, wherein only two types are a pedestrian and a background. It should be noted that the input to the network is a pair of images (false color image and original detection picture), while the false color image output from the PICA-net has a fixed size (224x224) due to structural limitations, while the input size to the detection network is 300x 300. The original color picture and the "pseudo-color picture" need to be uniformly resized to 300x 300. Before being input into a network, an RFB at the front end part is a multi-branch conv block, the reception fields of different scales are obtained by using convolution kernels of different scales, and a plurality of generated branches are uniformly pooled. The specific structure of the RFB and BasicRFB-a modules is as follows (see FIG. 6):

(1) continuing to perform 3x3 hole convolution, wherein the hole span is 1;

and the 3 branches are spliced firstly, then spliced with the result of the feature diagram extracted by the VGG16 after batch normalization (Batchnorm), and then sent to the Relu activation function layer.

For the basicrrfb-a module, consistent with the RFB module, the number of channels of the profile is first reduced using 1x1 convolution, followed by a configuration of 4 branches.

(1) Continuing to perform 3x3 hole convolution, wherein the hole span is 1;

(2) performing 1x3 convolution, and then performing 3x3 hole convolution, wherein the hole span is 3;

(3) performing 3x1 convolution, and then performing 3x3 hole convolution, wherein the hole span is 3;

(4) performing a 3x3 convolution, followed by a 3x3 hole convolution with a hole span of 5;

and 4, splicing the 4 branches, splicing the spliced result with the result of the feature diagram extracted by the VGG16 after batch normalization (Batchnorm), and then sending the result into a Relu activation function layer.

And 4.1, firstly, performing data enhancement (turning, symmetry and the like) on the training set containing the color image and the pseudo color image of the pedestrian in the step. When training the test network model, since our test network model is trained from the beginning, we adjust the learning rate to 0.01 in the warm-up phase (the first ten epochs) and then reduce it to 0.00001 in order to make the model convergence faster. The loss function is consistent with the original SSD detection network, and when the loss value basically tends to be unchanged, the training is stopped.

And 4.2, testing the trained detection network model on a test set, and compared with the existing RFB-net, greatly reducing the omission factor of the detection network model provided by the invention (compared with the RFB-net, the omission factor is reduced from 50.9% to 16.2% on the same data set).

The described examples are intended to be illustrative, not limiting. Therefore, the present invention includes, but is not limited to, the examples described in the detailed description, and all other embodiments that can be derived from the technical solutions of the present invention by those skilled in the art also belong to the protection scope of the present invention.

Claims

1. A real-time pedestrian detection method using background modeling enhanced data is characterized in that: comprises the following steps of (a) carrying out,

step 3, converting the saliency map into a pseudo color map;

step 4, establishing a detection network, wherein the detection network comprises two completely symmetrical sub-networks A and B and a final target detection layer, training the detection network, respectively inputting pictures containing pedestrians and corresponding pseudo color maps into the sub-networks A and B, finally inputting result maps obtained by the sub-networks A and B into the target detection layer, and stopping training when loss values tend to be constant, wherein the sub-networks A and B comprise parallel RFB modules and BasicRFB-a modules, the specific processing process is as follows,

(1) continuing to perform 3x3 hole convolution, wherein the hole span is 1;

(1a) continuing to perform 3x3 hole convolution, wherein the hole span is 1;

splicing the 4 branches, splicing with a result of a feature map covn7_2 extracted from VGG16 after batch normalization, and sending the result into a Relu activation layer;

2. The method of real-time pedestrian detection using background modeling enhanced data according to claim 1, wherein: the specific implementation of step 1 is as follows,

step 1.1, preparing data, namely utilizing a picture containing pedestrians and a background picture which is the same in illumination condition and does not contain the pedestrians and is shot in a monitoring mode, and carrying out rectangular frame marking on the part containing the pedestrians, wherein the size of the picture containing the pedestrians is the same as that of the background picture;

3. The method of real-time pedestrian detection using background modeling enhanced data according to claim 2, wherein: the specific implementation of step 1.2 is as follows,

comparing RGB channel values of pixel points at corresponding positions of every two pictures, setting R, G and B channel values of a certain pixel point P of a background picture and a picture containing pedestrians as R1, G1, B1, R2, G2 and B2 respectively, dividing the pixel points into two types, and enabling the pixel points to belong to a part P of the pedestrians_PAnd a part P belonging to the background_b，

If there is

Then P ∈ P_P

Otherwise P ∈ P_b

Where t is the discrimination threshold.

4. The method of real-time pedestrian detection using background modeling enhanced data according to claim 3, wherein: the specific implementation manner of introducing the two-dimensional gaussian distribution density function to estimate the probability that the position of a certain pixel point is a pedestrian in step 1.3 is as follows,

5. The method of real-time pedestrian detection using background modeling enhanced data according to claim 4, wherein: step 1.3 the overall loss function is expressed as,

wherein the first term is Euclidean distance difference between a certain pixel point and a pixel normalized by a background picture, D_maxRepresenting the maximum value of the distances between all pixel points and the pixels of the background picture, wherein a is an empirical coefficient for determining the weight, and actually determining the contribution of the distance between the pixel points and the center and the difference between the distance and the background picture to the loss function; after the Daimler data sets are uniformly scaled to be the same as the background pictures in size, the ratio of the number of times that each pixel point at the same position in different pictures is judged as a pedestrian to the total number of pictures in the Daimler data sets is counted, and therefore the frequency W of each pixel point belonging to the pedestrian is obtained_(i,j)。

6. The method of real-time pedestrian detection using background modeling enhanced data according to claim 2, wherein: in step 1.5, a plurality of foreground regions exist under the same rectangular frame, and in order to avoid the situation that pedestrians are cracked, whether the plurality of foreground regions marked under the same rectangular frame need to be combined or not is judged before binary output processing is carried out, and the specific implementation mode is as follows,

performing pairwise traversal on a plurality of foreground regions in a rectangular frame to judge whether to merge, and for t_i，t_j∈ T, T represents the set of multiple foreground regions in a rectangular frame, rate is the average ratio of the width to the height of the rectangular frame, x, y are the central coordinate positions of the bounding rectangle of the background region, T_k.width,t_kHeight is the width and height, t, respectively, of the circumscribed rectangle of the kth background region_k.x,t_kY is the center coordinate of the bounding rectangle of the kth background area, W, H are the width and height of the rectangular frame, D_r，D_r’Judging a threshold value; if t_i，t_jAll satisfy

And is

7. The method of real-time pedestrian detection using background modeling enhanced data according to claim 1, wherein: the significance detection network in the step 2 adopts PICA-net.