CN107657226B

CN107657226B - People number estimation method based on deep learning

Info

Publication number: CN107657226B
Application number: CN201710862828.1A
Authority: CN
Inventors: 解梅; 秦方; 李佩伦; 苏星霖
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2017-09-22
Filing date: 2017-09-22
Publication date: 2020-12-29
Anticipated expiration: 2037-09-22
Also published as: CN107657226A

Abstract

The invention discloses a people number estimation method based on deep learning, and belongs to people density estimation based on deep learning. The method adopts a single-row convolutional neural network based on convolutional layers and pooling layers, learns the crowd characteristics through training of a large number of samples, thereby estimating the crowd density map of the input image, and further integrating the density map to obtain the estimation of the number of crowds on the image. Compared with other current deep learning algorithms, the convolutional neural network adopted by the invention has the advantages of simple structure, low complexity, short training time and higher estimation accuracy.

Description

People number estimation method based on deep learning

Technical Field

The invention belongs to the technical field of digital images, and particularly relates to crowd density estimation based on deep learning.

Background

With the rapid development of scientific technology and the continuous improvement of economic level, the living demand of people is higher and higher, so that the rapid development of artificial intelligence is promoted, and the artificial intelligence technology is gradually applied to various fields including intelligent driving, intelligent monitoring, security and the like. The method has important application value in the fields of intelligent monitoring and security protection by estimating the number of people through the video images, and is beneficial to timely evacuating over-dense people and preventing safety accidents such as trampling and the like in large public places such as large activity sites, railway stations and the like by estimating the number of people in time through the images. In addition, the method can also be used for abnormal warning signals and the like.

Current people counting algorithms can be summarized in 3 categories:

(1) the method based on target detection comprises the following steps:

establishing a detection model according to the target characteristics of the pedestrians, wherein the selected target characteristics comprise human heads, overall pedestrian targets, combination of head and shoulder contours and the like, training a detector according to the characteristics, detecting the targets by combining a sliding window method, and counting the number of the detected targets, namely the number of people. The detector is mainly in a form of a feature plus classifier, the features mainly adopt features such as HOG (histogram of gradient directions), LBP (local binary pattern) and the like, and the classifier mainly adopts Adaboost, SVM and the like. The method based on detection has high accuracy dependency on the used target detection method, is only suitable for scenes with simple background, sparse number of people and no or less shelters among pedestrians, and has lower practicability and popularization.

(2) A method based on density map or population regression:

this method estimates the number of people in an image by building a regression model between image features and the number of people, or between image features and a population density map. The commonly used features include edge features, texture features, and the like, and the commonly used regression functions mainly include gaussian regression, linear regression, and the like. The method is mainly used for monitoring video scenes, and a target area in a video image is extracted by utilizing foreground segmentation so as to extract effective features. However, the algorithm mainly depends on feature selection, the accuracy of the existing methods based on edge information, texture information, fusion of multiple feature information and the like is poor, how to design effective features is still the main problem of the algorithm, and the method has high dependence on scenes and poor migratable capability among different scenes, namely poor generalization capability.

(3) The method based on deep learning comprises the following steps:

deep learning shows remarkable superiority in a plurality of research fields of computer vision at present, and although the deep learning algorithm is not applied to people counting, the algorithm has remarkable improvement in accuracy and generalization compared with the traditional algorithm. The method utilizes the deep convolutional neural network, trains the network learning population characteristics through a large number of labeled samples, and therefore outputs the number of people in the image. However, the existing deep learning algorithm mostly adopts a multi-column convolutional neural network, and has the problems of high complexity, large sample requirement and long training time.

Disclosure of Invention

The invention aims to: aiming at the existing problems, the people number estimation method based on the deep learning of the single-row convolutional neural network is provided.

The invention discloses a people number estimation method based on deep learning, which comprises the following steps:

constructing a deep neural network model: the single-row convolutional neural network is based on 10 convolutional layers and 2 pooling layers, wherein the sizes of convolutional kernels of the first 6 convolutional layers are all 5x5, the sizes of convolutional kernels of the 7 th to 9 th convolutional layers are all 3x3, and the size of a convolutional kernel of the last convolutional layer is 1x 1; the pooling mode of the 2 pooling layers adopts maximum pooling, and the size of each pooling core is 2x 2;

training the constructed deep neural network model by collecting training sample data to obtain a trained deep neural network model, wherein a loss function of the deep neural network model

Wherein

Density map obtained by network forward calculation, M is number of training samples, and real density map of input image

Wherein (x-x)_i) As an impulse function of the position of the head in the image, x_iRepresenting the position of the human head, N is the total number of the human head, and G is a Gaussian kernel;

inputting the image to be estimated into the trained deep neural network model to obtain an estimated density map of the image to be estimated, and integrating the estimated density map to obtain the estimated number of people of the image to be estimated.

In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that: the method is based on the single-row convolutional neural network, the loss function is constructed by singly using the density map, the network structure is simple and effective, the estimation accuracy is improved, the network complexity is reduced, the model training time is reduced, and meanwhile the overfitting risk of the network is reduced.

Drawings

FIG. 1: the people number estimation processing flow diagram based on deep learning is shown.

FIG. 2: people number estimation convolutional neural network structure diagram.

FIG. 3: the Network structure of the existing people number estimation Network MCNN (Multi-Column probabilistic Neural Network) and the Neural Network Crowd-CNN of the invention is compared, wherein 3-a is the existing MCNN Network structure, and 3-b is the Crowd-CNN Network structure of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the following embodiments and accompanying drawings.

The invention discloses a single-row convolutional neural network based on 10 convolutional layers and 2 pooling layers, which is named as crown-CNN for short, simplifies the existing deep learning network structure, and realizes the estimation of the number of people in an image. Referring to fig. 1, the specific implementation steps of the present invention are as follows:

step 1, constructing a deep neural network and training:

step 1-1 preparation of training data: aiming at the crown-CNN network structure, in the specific embodiment, databases UCSD, Shanghaitech PartA and Shanghaitech PartB which are commonly used in the field of people counting are adopted, and the marking information (ground route) of the sample is the head position information (x, y) in the image sample, namely the coordinates of the center pixel of the head in the image. And then calculating a density map as label (label) information of the network according to the head coordinates, and generating an LMDB data file (comprising training and test sample data) by using the sample image and the label information by using a tool under a Caffe frame.

Calculating a density map: and calculating a density map based on the Gaussian kernel of the sample according to the head position information in the training image sample. The density map based on a geometrically adapted gaussian kernel is calculated as:

wherein (x-x)_i) As an impulse function of the position of the head in the image, x_iIs head position vector, i.e. head position information (x, y), N is total number of heads, G is Gaussian kernel.

Step 1-2, constructing a network: the overall structure of the deep learning network of the invention is shown in fig. 2, and the detailed structure is shown in fig. 3-b. It has 10 convolution layers and 2 pooling layers, and adopts maximum poolingAnd the loss function adopts a Euclidean distance loss function. The Euclidean distance Loss function (Euclidean Loss) is calculated as:

wherein

Density map obtained for network forward calculation, and F_nFor inputting images

And (3) inputting the label information of the network into the real density graph F (x) calculated by the formula, wherein M is the number of training samples.

Step 1-3 training the network: and (3) loading the training data and the test data (LMDB files) generated in the step (1-2) and the network file constructed in the step (1-2) into a Caffe training execution process by utilizing a Caffe framework, calculating a network error through the forward calculation of the network and a loss function formula (2), reversely propagating the error, calculating an error gradient of each layer of weight of the network, updating the weight, and gradually reducing the network error. And continuously and circularly executing the process, searching the most effective network training parameters, reducing the network loss to the minimum or to a value meeting the requirement, namely finishing the training process of the network and obtaining a network model, wherein the process can be simply summarized as parameter optimization.

Step 2, testing:

sending the image to be detected into the network structure constructed in the step 1, loading the network model parameters trained in the step 1 for forward calculation to obtain an estimated density map of the image

Integrating the density map to obtain the estimated number of people

The invention adopts two algorithm measurement standards which are universal in the field of people counting, namely average absolute error (MAE) and Mean Square Error (MSE), in the test experiment, and the two algorithm measurement standards are respectively used for measuring the accuracy and the stability of the algorithm.

Mean Absolute Error (MAE) definition:

mean Square Error (MSE) definition:

wherein M is the number of test samples, Z_iTo test the actual number of people in sample i,

the number of people of the test sample i calculated for the network.

Compared with the MCNN network with better performance (the network structure is shown in figure 3-a) and the simple structure network provided by the invention, the network structure adopted by the invention is simple, the training time is greatly reduced, and the accuracy is ensured at the same time by the experimental test on the universal people counting database UCSD, Shanghaitech PartA and Shanghaitech PartB. The results of the experimental comparison are shown in tables 1, 2 and 3.

TABLE 1 network training iteration number comparison

TABLE 2MCNN network test results

	MSE	MAE
			Shanghaitech PartA	173.2	110.2
Shanghaitech PartB	41.3	26.4
			UCSD	1.35	1.07

TABLE 3Crowd-CNN network test results

	MSE	MAE
			Shanghaitech PartA	170.38	109.05
Shanghaitech PartB	42.1	26.04
			UCSD	1.21	1.03

The comparison and verification show that the accuracy of the people number estimation based on the crown-CNN network structure is high, compared with the MCNN network structure, the method has a simpler network structure, and the network parameters and the training time are greatly reduced, so that the requirement on the training data volume is greatly reduced, and the risk of network overfitting is reduced. And meanwhile, the error is also reduced.

While the invention has been described with reference to specific embodiments, any feature disclosed in this specification may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise; all of the disclosed features, or all of the method or process steps, may be combined in any combination, except mutually exclusive features and/or steps.

Claims

1. A method for estimating the number of people based on deep learning is characterized by comprising the following steps:

constructing a deep neural network model:

the single-row convolutional neural network is based on 10 convolutional layers and 2 pooling layers, wherein the sizes of convolutional kernels of the first 6 convolutional layers are all 5x5, the sizes of convolutional kernels of the 7 th to 9 th convolutional layers are all 3x3, and the size of a convolutional kernel of the last convolutional layer is 1x 1; the pooling mode of the 2 pooling layers adopts maximum pooling, and the size of each pooling core is 2x 2;

preparing training data:

adopting a common people counting database UCSD, Shanghaitech PartA and Shanghaitech PartB in the people counting field, wherein the marking information of the sample is the head position information (x, y) in the image sample, namely the coordinates of the head center pixel in the image; then calculating a density map as label information of the network according to the head coordinates, and generating an LMDB data file comprising training data and test data from the sample image and the label information by using a tool under a Caffe frame;

calculating a density map: calculating a density map of the sample based on a Gaussian kernel according to the head position information in the training image sample; based on density maps of geometrically adapted Gaussian kernelsThe calculation is as follows:

wherein, (x-x)_i) As an impulse function of the position of the head in the image, x_iThe method comprises the following steps of (1) obtaining a head position vector, namely head position information (x, y), wherein N is the total number of heads, and G is a Gaussian kernel;

training the constructed deep neural network model based on training sample data to obtain a trained deep neural network model:

loading the generated training data and test data and the constructed network file of the deep neural network model into a Caffe training execution process by utilizing a Caffe framework, calculating a network error through the forward calculation of the network and a loss function L (theta), reversely propagating the error, calculating an error gradient of each layer of weight of the network, updating the weight, and gradually reducing a network error value; continuously and circularly executing the process, and searching the most effective network training parameters to reduce the network loss to the minimum or to a value meeting the requirement; wherein a loss function of the deep neural network model

Wherein

Density map obtained for network forward calculation, M being number of training samples, F_nAccording to a formula for the input image

Calculating to obtain a real density graph F (x), namely inputting label information of the network;

during training, the number of samples selected by one training sample is set to be 1, the learning rate base _ lr is set to be 1e-7, and the training iteration frequency of the trained deep neural network model is 80 ten thousand times;

inputting an image to be estimated into a trained deep neural network model to obtain an estimated density map of the image to be estimated, and integrating the estimated density map to obtain the estimated number of people of the image to be estimated;

testing the trained deep neural network model in a people counting database UCSD, Shanghaitech PartA and Shanghaitech PartB, wherein the average absolute error and mean square error corresponding to each people counting database are specifically as follows:

the average absolute error and the mean square error of the UCSD are respectively as follows: 1.03, 1.21;

the average absolute error and the mean square error of the people counting database Shanghaitech PartA are respectively as follows: 109.05, 170.38;

the average absolute error and the mean square error of the people counting database Shanghaitech PartB are 26.04 and 42.1 respectively.