CN108804992B

CN108804992B - Crowd counting method based on deep learning

Info

Publication number: CN108804992B
Application number: CN201710318219.XA
Authority: CN
Inventors: 雷航; 杨铮
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2017-05-08
Filing date: 2017-05-08
Publication date: 2022-08-26
Anticipated expiration: 2037-05-08
Also published as: CN108804992A

Abstract

The method comprises the steps of extracting a motion prospect from a video, ensuring the visual angle and the perspective invariance of a camera by using a human body region model, and finally determining the statistical population of the human body region through preprocessing, extraction and detection. The method can reduce the search area of the sliding window and improve the search efficiency, overcomes the deformation of the monitoring video caused by the visual angle, the distance from the monitoring scene and the like, is simple in system installation and deployment, improves the human body detection accuracy rate based on the detection model of the deep learning convolutional neural network, eliminates redundant sub-areas by using a non-maximum inhibition method and reduces repeated counting, so that the results of detecting the human body, positioning the human body and counting the number of people are more accurate.

Description

Crowd counting method based on deep learning

Technical Field

The method belongs to the field of intelligent video monitoring, and particularly relates to a crowd counting method based on deep learning.

Background

With the popularization of video monitoring systems, cameras are distributed in all corners of cities. First, in the face of such a huge amount of cameras and surveillance videos, it is impractical to use an artificial approach to discriminate the behavior and attributes of people in a surveillance scene. Secondly, in the face of complicated scenes such as rainy days, snowy days, night scenes or ultra-dense crowds, it is difficult to identify people therein simply by naked eyes, not to mention counting the number of people therein.

At present, people counting methods applied to video monitoring systems are mainly classified into three categories: the first type judges and counts human bodies one by utilizing the sliding of a detector in an image; secondly, extracting crowd movement track characteristics from the images and clustering, wherein the clustering result is a crowd counting result; and the third category utilizes a statistical method to estimate the distribution of the crowd so as to calculate the density of the crowd and obtain the number of the crowd through calculation. However, the above methods all adopt a manual feature extraction mode, and are not suitable for complex scenes, or the methods cannot handle object deformation caused by viewing angles and perspective reasons without introducing perspective and observation angle invariance, and cannot be well suitable for scenes with wide viewing fields, or the methods for solving the perspective and observation angle invariance are adopted, but the accuracy is too dependent on parameters such as shooting angles and observation distances of a user manually measured camera, so that the installation configuration of the system is complicated. The image processing by the detector is determined by the quality of the detector, and the traversing calculation amount of the whole image by the sliding window is huge, so that the real-time performance is difficult to ensure.

Disclosure of Invention

The invention provides a method, which can not only reduce the search area of a sliding window to the maximum extent to improve the detection efficiency and reduce the interference of a complex scene on human body detection, but also can realize perspective and observation angle invariance by simple configuration.

In order to achieve the above purpose, the invention provides a crowd counting method based on deep learning, which comprises the following steps:

step 1, performing white balance preprocessing on a preprocessed image by adopting a gray world algorithm;

step 2, extracting the preprocessed image by adopting a background segmentation method of a K nearest neighbor algorithm;

step 3, traversing the extracted image pixels by adopting a method for ensuring the invariance of visual angles and perspectives, and inputting the coordinates (x, y) of the image pixels into a trained linear model to obtain the size of the human body region;

step 4, adopting a convolutional neural network as a human body to detect a model;

and 5, counting the final human body number.

Further, the gray world algorithm in step 1 performs white balance preprocessing on the image, and further includes the following steps:

1) averaging three channels of the preprocessed image;

2) obtaining the gain of each channel and superposing the gain value to the original image;

3) planning and processing the result;

the formula is as follows:

I _out ＝(R _new ，G _new ，B _new )

wherein M is _R 、M _G 、M _B Representing the mean of the three channels of the input image R, G, B, respectively, alpha representing the global mean of the three channels, K representing the gain value of each channel, R _new 、G _new 、B _new Representing the three channels behind the superimposed gain, I _out Representing the image after gain superposition; for the above process, there may be overflow (>255, no less than 0) appears, experiments show that if it is directly going to do so>Setting 255 pixels to 255 may cause the image to be entirely whitish, so calculating all R is used _new 、G _new 、B _new And then using the maximum value to linearly map the calculated data back to [0,255 []And (4) the following steps. The image is subjected to white balance preprocessing to automatically equalize the gray values of the pixels.

Furthermore, the extracted image in the step 2 adopts a background segmentation method of a K-nearest neighbor algorithm to traverse each pixel of the input image, find K pixel points which are closest to the pixel in a certain neighborhood, carry out majority voting on the categories of the points, and determine the category of the current pixel; the classification decision rule is as follows:

wherein I (-) is an indicator function, i.e. when y _i ＝c _j The time function goes to 1, otherwise 0.

Further, performing dilation and erosion operation on the image subjected to the extraction and pretreatment in the step 2.

Further, in step 3, traversing each pixel of the foreground region by using a method for ensuring the invariance of the viewing angle and the perspective, taking the pixel coordinate (x, y) as the center of the sub-region, and inputting the pixel coordinate (x, y) into the trained linear model to obtain the size of the human body region; the calculation formula of the human body region size is as follows:

w＝θ ₀ +θ ₁ ·x+θ ₂ ·y

h＝ω ₀ +ω ₁ ·x+ω ₂ ·y

wherein w, h represent the width and height of the body region at coordinates (x, y), respectively; theta, omega represent the weight of a linear model for finding the width and height of a human body region, respectively _i And ω _i The weights are learnable weights obtained by manually intercepting human body regions from a detection scene and training the human body regions by using a linear regression algorithm.

Further, in step 4, all the calculated sub-images of the human body region captured from the original image are input to the convolutional neural network to determine whether the sub-images are human bodies. The human body size calculation is based on a linear regression human body region model. Sequencing all the regions judged to have human bodies according to the network output value, namely the confidence degrees of the judged human bodies, taking the region with the highest confidence degree as a standard, and removing all the regions exceeding a certain set threshold value; the formula is as follows:

wherein S is _over An area indicating an overlapping portion of two regions involved in the determination; s represents the sum of the areas of the two regions participating in the judgment; the area where f (o) is 0 is removed, and the remaining area is the final result.

Furthermore, the convolutional neural network described in step 4 is used as a human body detection model, and the network structure of the convolutional neural network refers to the cifar10 network in the mask deep learning framework, so that parameters of each layer of the network are simplified.

Because the invention adopts the technical scheme, the invention has the following beneficial effects:

the method comprises the steps of extracting a motion prospect from a video, ensuring the visual angle and the perspective invariance of a camera by using a human body region model, and finally determining the statistical population of the human body region through preprocessing, extracting and detecting. The method can reduce the search area of the sliding window and improve the search efficiency, overcomes the deformation of the monitoring video caused by the visual angle, the distance from the monitoring scene and the like, is simple in system installation and deployment, improves the human body detection accuracy rate based on the detection model of the deep learning convolutional neural network, eliminates redundant sub-areas by using a non-maximum inhibition method and reduces repeated counting, so that the results of detecting the human body, positioning the human body and counting the number of people are more accurate.

Drawings

FIG. 1 is a flow chart of a people counting method;

FIG. 2 is a flowchart of a human body region model training process;

FIG. 3 is a human detection model training flow diagram;

fig. 4 is a diagram of a human body detection convolutional neural network structure.

Detailed Description

The invention is further described below with reference to the following figures and examples.

As shown in fig. 1-3, a deep learning-based demographics method includes the following steps:

step 1, performing white balance preprocessing on a preprocessed image by adopting a gray scale world algorithm;

furthermore, the white balance preprocessing method of the gray scale world algorithm firstly averages the three channels of the preprocessed image, then obtains the gain of each channel and superposes the gain value on the original image, and finally plans the result. The image subjected to white balance processing can automatically balance the gray value of the pixels, prevent the whole image from being slightly bright or dark, and remove the interference of illumination to a certain extent.

The formula is as follows:

I _out ＝(R _new ，G _new ，B _new )

wherein M is _R 、M _G 、M _E Representing the mean of the three channels of the input image R, G, B, respectively, alpha representing the global mean of the three channels, K representing the gain value of each channel, R _new 、G _new 、B _new Representing the three channels after the superposition of gains, I _out Representing the image after gain superposition. For the above process, there may be overflow (>255, no less than 0) appears, experiments show that if it is directly going to do so>Setting 255 pixels to 255 may cause the image to be entirely whitish, so calculating allR _new 、G _new 、B _new And then using the maximum value to linearly map the calculated data back to [0,255 []And (4) inside.

further, the video motion foreground extraction technology based on the K Nearest Neighbor (KNN) algorithm traverses each pixel of the input image, finds K pixel points closest to the pixel in a certain neighborhood, performs majority voting on the categories of the points, determines the category of the current pixel, and updates the background. Dividing each framing of the video into a background or a foreground; the classification decision rule is as follows:

And further, performing expansion corrosion operation on the extracted motion foreground to eliminate noise and obtain a final foreground area.

Dilation is the process of merging all background points in contact with an object into the object, expanding the boundary outward. Such as using a 3x3 structuring element, i.e., a dilation operation template. Each pixel of the scanned image is ored with the binary image covered by the structural element, if both are 0, the pixel of the scanned image is 0, otherwise, the pixel is 1. The result is a one-turn enlargement of the binary image.

Erosion is a process by which boundary points are eliminated and the boundaries are shrunk inward. Can be used to eliminate small and meaningless objects. Such as with a 3x3 erosion algorithm template. Each pixel of the scanned image is ANDed with the binary image covered by the structural element, if both are 1, the pixel of the scanned image is 1, otherwise, the pixel is 0. The result is a one-turn reduction of the binary image.

furthermore, each pixel of the foreground area is traversed by a sliding window method, then each traversed pixel point coordinate (x, y) is input into a human body area model to obtain the size of the human body area, and the relationship between the pixel space coordinate of the fixed scene image and the size of the human body area is established by adopting a linear regression model and ensuring the invariance of the visual angle and the perspective. Before training, the human body areas at various positions are manually intercepted from the scene, and all coordinates from far to near are covered as far as possible. And then training the model by using linear regression to obtain a human body region model. The formula is as follows:

where equation (1) is the objective function, h _θ (x) A linear estimation function for the target problem is represented, and y represents a real value of the target problem; equation (2) is a weight update function, θ represents the weight of the linear model, and α represents the learning rate.

Will I _out And intercepting subimages, inputting the subimages into a trained linear human body model, and judging whether the subimages are human bodies. In particular, the pixel coordinate (x, y) of each traversal is taken as the center of the subarea from I _out And intercepting the subimage as an image to be detected. The calculation formula of the human body region size is as follows:

w＝θ ₀ +θ ₁ ·x+θ ₂ ·y

h＝ω ₀ +ω ₁ ·x+ω ₂ ·y

where w, h represent the width and height of the body region at coordinates (x, y), respectively. Theta, omega represent the weight of the linear model for finding the width and height of the human body region, respectively, theta _i And ω _i The weights are learnable weights obtained by manually intercepting human body regions from a detection scene and training the human body regions by using a linear regression algorithm. Since the perspective relation between the distance and the size of the object is a linear relation, the human body dimension calculation is based on a human body region model of linear regression. The machine learning method is used for ensuring the invariance of the visual angle and the perspective, so that the system can be corrected again simply no matter the system is installed and deployed, or the shooting angle and the shooting distance of the camera are changed.

Step 4, adopting a convolutional neural network as a human body detection model;

the convolutional neural network is used as a human body detection model, and the network structure of the convolutional neural network refers to a cifar10 network in a cafe deep learning framework, so that parameters of each layer of the network are simplified. In the training process of the convolutional neural network model based on deep learning, firstly, human body samples are collected from a large number of monitoring videos, namely a human body database, a human body sample database with 1600 positive samples and 1600 negative samples is finally obtained, and the human body sample database is used as a training sample network to obtain a human body detection model.

The convolutional neural network structure shown in fig. 4 contains two convolutional layers (convolution), two maximum pooling layers (max boosting), two local normalization layers (local response normalization), two full connection layers (full connection), and a softmax classifier. The convolutional layer is used for feature extraction; the pooling layer compresses the input feature map and reduces the scale of the features to obtain more inductive features; the local normalization layer functions like the activation layer, in that it normalizes input features to speed up training; the full-connection layer role summarizes the input features and maps the input features to a specific high order space for classification; the softmax layer is used to classify the feature vectors. In the human body detection convolution network model, the input of the model is 24 × 24 images with 3 channels, a 24 × 24 feature map of 16 channels is obtained through a first convolution layer Conv1 and an activation function layer Relu1, then a feature map of 16 channels 12 × 12 is obtained through Pooling layer Max Pooling1 sampling, then the feature map of a local normalization layer is kept unchanged in size, then the convolution layer Conv2, the activation function layer Relu2, the local normalization layer and the Pooling layer Max Pooling2 are respectively and sequentially input to obtain a feature map with 16 channels of 6 × 6, and finally the feature map is converted into a 2-dimensional feature vector through two fully connected layers and a softmax classifier. Such an input image of 24 x 24 with 3 channels is classified into 2 classes, human or non-human, via a convolutional network.

And 5, counting the final human body number.

Further, a non-maximum suppression algorithm is adopted for all the sub-regions judged as human bodies in the step 4, and redundant regions are removed. And sequencing all the regions judged to have the human body according to the network output value, namely the confidence coefficient of the judged human body, and then taking the region with the highest confidence coefficient as a standard to remove all the regions exceeding a certain set threshold value. The formula is as follows:

wherein S is _over An area indicating an overlapping portion of two regions involved in the determination; s represents the sum of the areas of the two regions participating in the judgment; o represents the proportion of the area overlapping portion in the entire region. The area where f (o) is 0 is removed, and the remaining area is the final result. According to the experiment, the method achieves the best effect when the sigma is 0.2.

Claims

1. A crowd counting method based on deep learning is characterized by comprising the following steps:

step 3, traversing each pixel of the foreground area by adopting a method for ensuring the invariance of the visual angle and the perspective, taking the pixel coordinates (x, y) as the center of the sub-area, and inputting the pixel coordinates (x, y) into a trained linear model to obtain the size of the human body area; the calculation formula of the human body region size is as follows:

w＝θ ₀ +θ ₁ ·x+θ ₂ ·y

h＝ω ₀ +ω ₁ ·x+ω ₂ ·y

wherein w, h represent the width and height of the body region at coordinates (x, y), respectively; theta, omega represent the weight of the linear model for finding the width and height of the human body region, respectively, theta _i And omega _i The weight is learnable weight, and is obtained by manually intercepting a human body region from a detection scene and training by using a linear regression algorithm;

step 4, adopting a convolutional neural network as a human body detection model; inputting all calculated sub-images of the human body region intercepted from an original image into a convolutional neural network to judge whether the sub-images are human bodies;

and 5, counting the final number of human bodies.

2. The deep learning based demographics method of claim 1, wherein: step 1, the gray world algorithm carries out white balance preprocessing on the image, and further comprises the following steps:

1) averaging three channels of the preprocessed image;

3) planning and processing the result;

the formula is as follows:

I _out ＝(R _new ，G _new ，B _new )

wherein M is _R 、M _G 、M _B Representing the mean of the three channels of the input image R, G, B, respectively, alpha representing the global mean of the three channels, K representing the gain value of each channel, R _new 、G _new 、B _new Representing the three channels behind the superimposed gain, I _out Representing the image after gain superposition; for the above-described processing, there may be overflow>255, the phenomenon less than 0 can not occur, and experiments show that if the product is directly used>Setting 255 pixels to 255 may cause the image to be entirely whitish, so calculating all R is used _new 、G _new 、B _new And then using the maximum value to re-linearly map the calculated data to [0, 255%]And (4) the following steps.

3. The deep learning based demographics method of claim 1 or 2, wherein: the grey values of the pixels of the image subjected to white balance preprocessing in the step 1 can be automatically equalized.

4. The deep learning based demographics method of claim 1, wherein: the extracted image in the step 2 adopts a background segmentation method of a K-nearest neighbor algorithm, traverses each pixel of the input image, finds K pixel points closest to the pixel in a certain neighborhood, carries out majority voting on the categories of the points, and determines the category of the current pixel; the classification decision rule is as follows:

5. The deep learning based demographics method of claim 1 or 4, wherein: and (3) performing expansion corrosion operation on the image subjected to the extraction pretreatment in the step (2).

6. The deep learning based demographics method of claim 1, wherein: sorting all the regions judged to have human bodies according to the network output value, namely the confidence degrees of the judged human bodies, taking the region with the highest confidence degree as a standard, and removing all the regions exceeding a certain set threshold; the formula is as follows:

wherein S is _over An area indicating an overlapping portion of two regions involved in the determination; s represents the total area of two regions participating in the decisionAnd; the area where f (o) is 0 is removed, and the remaining area is the final result.

7. The deep learning based demographics method of claim 1, wherein: and 4, taking the convolutional neural network as a human body detection model, and simplifying the parameters of each layer of the network by referring to the cifar10 network in the mask deep learning framework by the network structure of the convolutional neural network.

8. The deep learning based demographics method of claim 1, wherein: the human body size calculation is based on a linear regression human body region model.