CN111611878A

CN111611878A - Method for crowd counting and future people flow prediction based on video image

Info

Publication number: CN111611878A
Application number: CN202010364590.1A
Authority: CN
Inventors: 李小玉; 翁立; 赖晓平
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2020-04-30
Filing date: 2020-04-30
Publication date: 2020-09-01
Anticipated expiration: 2040-04-30
Also published as: CN111611878B

Abstract

The invention discloses a method for crowd counting and future people flow prediction based on video images. The invention comprises the following steps: 1. selecting a video image data set with annotation information, and performing Gaussian function processing according to the annotation of the head position to generate a true-value density chart; 2. and inputting the video frame into a constructed MPDC model to extract a characteristic map, and mapping the characteristic map into a crowd estimation density map (DE). 3. And inputting the obtained DE stacked frames into the constructed Bi-ConvLSTM network, predicting a crowd prediction density map at the T +1 moment, and estimating the number of pedestrians at the T +1 moment. The method adopts a convolution network based on multi-scale pyramid holes and a Bi-ConvLSTM network based on residual connection, uses continuous video frames to generate a crowd estimation density map, further predicts the crowd prediction density map of future frames, and counts the crowd quantity. The invention aims at the prediction of continuous video images, is a brand new method, and not only can obtain a real-time crowd density map and the number of people, but also can predict the crowd density map and the flow of people of a future frame.

Description

Method for crowd counting and future people flow prediction based on video image

Technical Field

The invention belongs to the field of crowd image processing in computer vision, and particularly relates to a method for crowd counting and future people flow prediction based on a video image.

Background

People counting, that is, counting the number of people in a picture or a video sequence. People and crowd counting and forecasting have important significance for public safety management, regional space planning, information resource acquisition and the like. The system can better monitor and dredge the crowds in the public places, and provides a basis for reasonable scheduling of personnel, reasonable planning of routes, reasonable dredging of people streams and site selection of buildings.

The existing population counting methods can be divided into three categories: population counting methods based on detection, regression and density map estimation. The detection-based method is suitable for large and sparse target scenes. However, the method of regressing the number of people cannot estimate the congestion distribution, and spatial information of the position of the target is lost.

With the development of the crowd counting technology, the crowd counting algorithm is developed from simply calculating the number of pedestrians to obtaining a crowd estimated density map, so that the number of pedestrians can be counted, and meanwhile, the crowd density distribution condition can be obtained. Compared with the former two methods, the method based on the deep learning regression density map can effectively solve the crowd shielding problem to a certain extent, and can analyze the specific density distribution information of pedestrians through the density map.

However, under a complex background, in a scene in which high-density people gather, due to the influence of interference factors such as target shielding, perspective of view angle, scale change, uneven density distribution and the like caused by high overlapping between people, great resistance is generated to the counting of people and the acquisition of density information. In the existing method for generating the density map based on deep learning, multiple rows of convolution layers and large convolution kernels are adopted for extracting the features of the multi-scale image, so that a large number of parameters are generated, and the training difficulty of the network is increased. In addition, people flow prediction algorithm based on crowd video images is not researched.

Disclosure of Invention

The invention aims to provide a method for counting crowds and predicting the flow of people in the future based on a video image, aiming at the problems in the existing crowd counting field.

The technical scheme of the invention generally comprises the following steps:

step 1, selecting a video image data set with artificial annotation information, and performing Gaussian function processing according to annotation of head positions in the image to generate a true value density chart corresponding to an original image;

step 2, firstly, a multi-scale Pyramid hole convolution network Model (MPDC) is built. And inputting continuous video frames in the video image data set into the MPDC model, fully extracting a characteristic diagram with multi-scale information, mapping the characteristic diagram into a crowd estimation density map (DE), integrating the crowd estimation density map (DE), and counting the number of people in the crowd estimation density map.

The multi-scale pyramid cavity convolution network model is divided into two parts: the structure of the first part is a VGG-Basic network; the second part is composed of four parallel cavity convolution layers with different cavity rates; and channel splicing is carried out on the characteristic diagram of the output of each branch and the characteristic diagram of the output of the VGG-Basic network.

And 3, constructing a Bi-ConvLSTM network based on residual connection. And inputting the obtained crowd estimation density map (DE) into the Bi-ConvLSTM network in a stacking manner for a plurality of frames, predicting the crowd estimation density map (PE) at the T +1 moment, and estimating the number of pedestrians at the T +1 moment.

The bidirectional ConvLSTM module is an improvement of the traditional ConvLSTM, an input crowd estimation density map (DE) is calculated by the forward and reverse superposition of two ConvLSTM units, and output contains forward sequence information and reverse sequence information.

And 4, pre-training the multi-scale pyramid cavity convolution network Model (MPDC) in the step 2, storing model parameters, storing the obtained crowd estimation density map (DE), and inputting the stored crowd estimation density map into the Bi-ConvLSTM network based on residual connection in the step 3 for training. A random gradient descent algorithm is adopted to optimize parameters in a multi-scale pyramid hole convolution network model and a Bi-ConvLSTM network, and Euclidean distance is used for calculating loss between a crowd prediction density map (PE) and a truth density map.

Preferably, the specific steps of step 1 are as follows:

and (3) converting the head position label in the input video image data set into a truth-value density graph by utilizing a two-dimensional Gaussian convolution kernel, and using the truth-value density graph as a training set with truth values for calculating loss difference.

In order to better correspond the truth density map to the dense crowd images at different viewing angles, the truth density map based on the geometric adaptive gaussian kernel is selected and represented by the following formula:

the truth density map is obtained by convolving the delta pulse function with a Gaussian function, and summing after convolution. x is the number of_iRepresenting the position of the ith individual's head in the image, i.e. the pixel coordinates of the ith individual's head in the image, (x-x)_i) Pulse function representing the position of the head in an image, N₁The total number of the human heads in the image;

is a distance x from the head position_iThe average distance of the nearest m head positions proves that β is 0.3 to be the best effect.

Representing the position x of the human head_iTo the head position x_jThe distance of (c).

The processing of step 1 converts the original image with head annotation in the video image data set into a true density map.

Preferably, the specific steps of step 2 are:

and generating an estimated crowd density map by continuous video images in the video image data set through a multi-scale pyramid cavity convolution network model.

The multi-scale pyramid cavity convolution network model is divided into two parts:

the first part is VGG-Basic, which takes a VGG-16 network as a Basic framework, only the first 10 convolutional layers and the first 3 maximum pooling layers are reserved, and the rest layers are completely removed;

the second part is composed of four parallel cavity convolution layers with different cavity rates, and four characteristic graphs of different receptive field multi-scale information are respectively generated; wherein the voidage is sequentially set to r-2^lAnd l is 1,2,3 and 4, each hole convolutional layer has 5 convolutional layers, the size of a convolutional kernel is set to be C3, then the feature maps output by the four branches and the feature map output by the VGG-Basic are spliced on a channel, 1 × 1 convolutional layer is adopted to carry out feature dimension reduction and is mapped into a crowd estimation density map (DE), and the number of people is integrated and counted on the crowd estimation density map (DE).

Preferably, the specific steps of step 3 are:

for the density map prediction of consecutive video images, a Bi-ConvLSTM network based on residual concatenation is proposed. And (3) inputting the crowd estimated density map (DE) obtained in the step (2) into a Bi-ConvLSTM network based on residual connection for reconstruction and prediction, and inputting the crowd estimated density map sequence at the continuous time of { T-T., T-1, T } into the Bi-ConvLSTM network based on residual connection.

The Bi-ConvLSTM network based on residual connection uses ConvLSTM as a basic structure, the ConvLSTM is replaced by a bidirectional ConvLSTM structure, an input crowd estimation density graph is calculated by forward and reverse superposition of two ConvLSTM units, and an output characteristic graph comprises forward sequence information and reverse sequence information and is used for reconstructing a crowd estimation density graph sequence and predicting a future video frame sequence.

The spatio-temporal sequence prediction problem is to predict the most likely K video frame sequences in the future from the previous J training video frames,

wherein the previous frame

Most likely future frame F_m＝{X_t+1,X_t+2...,X_t+KPredicting future frames

t denotes the current time, J denotes the number of previous frames, K denotes the number of predicted frames, and σ () is a softmax function.

The Bi-ConvLSTM network is composed of bidirectional ConvLSTM, BN, ReLU activation functions and residual connection structures. And finally, obtaining a crowd prediction density map at the T +1 moment through convolution and ReLU function activation, and performing integral statistics on the pedestrian volume at the T +1 moment.

Preferably, the specific content of step 4 is:

training process: pre-training the multi-scale pyramid cavity convolution network Model (MPDC) in the step 2, storing model parameters, storing the obtained crowd estimation density map (DE) as the input of the step 3, and inputting the crowd estimation density map into a Bi-ConvLSTM network based on residual connection for training.

The loss between the crowd predicted density map and the truth density map is calculated by using Euclidean distance, and the invention adopts a random gradient descent algorithm to optimize parameters until the loss value converges to the predicted value.

The euclidean distance is used to measure the difference between the predicted and true density maps of the population. When the distance between the crowd prediction density graph and the truth value density graph is generated by adopting Euclidean distance measurement, the loss function is defined as follows:

wherein N represents the number of pictures input into the multi-scale pyramid hole convolution network model, and Z (X)_i(ii) a Theta) is a crowd density estimation graph corresponding to the ith input picture, X_iWhich represents the i-th input picture,

is shown asThe truth density map of i input picture pairs. Θ represents the network parameters to be learned.

When the crowd predicted density map is evaluated, the commonly used Mean Square Error (MSE) and Mean Absolute Error (MAE) are adopted, the MSE is used for describing the accuracy of the crowd predicted density map, the accuracy is higher when the MSE is smaller, and the MAE can reflect the error condition of the crowd predicted density map.

N represents the number of pictures input into the multi-scale pyramid hole convolution network model, C_iThe predicted number of people in the crowd predicted density graph corresponding to the ith input picture is shown,

and the real number of people corresponding to the ith input picture is represented.

The testing process comprises the following steps: and selecting a new continuous video frame data set, inputting the new continuous video frame data set into the trained model for testing, outputting a crowd prediction density graph, and counting results.

The invention has the following beneficial effects:

the method adopts a Bi-ConvLSTM network based on multi-scale pyramid hole convolution network and residual connection, uses continuous video frames to generate an estimated crowd estimation density map (DE), predicts a crowd density map (FP) of a future frame, and further predicts the flow of people. The invention aims at predicting the density map and the pedestrian volume of the crowd in the video image target, and is a brand new method. The method comprises the steps that a cavity convolution network is selected from a multi-scale pyramid cavity convolution network to replace the traditional convolution-pooling-upsampling process, the receptive field is expanded while the precision is not lost, the target is accurately positioned, four groups of parallel cavity convolutions are adopted to form a pyramid mode, and the characteristics of an image are fully extracted by utilizing the receptive fields with different sizes; by fusing the outputs of different convolution layers, the learned features have more complete representation on the image; in a Bi-ConvLSTM network based on residual connection, a density map of a future frame is predicted by utilizing the strong space-time feature extraction capability of the ConvLSTM network, the ConvLSTM is replaced by the bidirectional ConvLSTM, the input density map is calculated by the forward and reverse superposition of two ConvLSTM units, and the output contains forward sequence information and reverse sequence information. Compared with the existing crowd counting technology, the method provided by the invention is used for counting the crowd of the video image, so that not only can a real-time crowd density map and the number of people be obtained, but also the crowd density map and the flow of people of a future frame can be predicted.

Drawings

FIG. 1 is a flow chart of the overall text of the network of the present invention;

FIG. 2 is a network structure for generating a density map based on a multi-scale pyramid hole convolution network;

FIG. 3 is a network structure of a residual concatenation based Bi-ConvLSTM predicted future frame density map;

FIG. 4 is a network structure of a bidirectional ConvLSTM module according to the present invention;

FIG. 5 is a flowchart of the network model training process of the present invention.

Detailed Description

The invention provides a method for crowd counting and people flow prediction based on a video image. 1) A VGG-Biasc structure is selected to perform preliminary feature extraction, and the VGG-Biasc structure is composed of a series of Convolutional Neural Networks (CNN) with a plurality of layers of small convolutional kernels, so that the image feature characterization capability is strong, and the Network training parameters are simplified; 2) a cavity convolution network is selected to replace the traditional convolution-pooling-up-sampling process, so that the receptive field is enlarged and the target is accurately positioned without losing precision, four groups of parallel cavity convolution layers are adopted to form a pyramid mode, and the multi-scale information of the image is acquired by using different receptive fields; 3) by fusing the outputs of different convolution layers, the learned features have more complete representation on the image; 4) the video frame density map at future time instants is predicted by a Bi-directional convolution long short term memory (Bi-ConvLSTM network) based on residual concatenation. The method comprises the steps of selecting t density map stacks as input according to the step length, predicting the density map of a future frame by utilizing the strong space-time feature extraction capability of a ConvLSTM network, and predicting the crowd quantity at the future moment. Compared with the existing crowd counting technology, the method provided by the invention is used for counting the crowd of the video image, so that not only can a real-time crowd density map and the number of people be obtained, but also the crowd density map and the flow of people of a future frame can be predicted.

As shown in fig. 1, step 1, selecting a video image data set with artificial annotation information, and performing gaussian function processing according to annotation of a head position in an image, so as to generate a true density map corresponding to an original image; step 2, firstly, a multi-scale Pyramid hole Convolution network Model (MPDC) is built. Inputting continuous video frames in a video image data set into an MPDC model, fully extracting a characteristic diagram with multi-scale information, mapping the characteristic diagram into a crowd estimation density map (DE), integrating the crowd estimation density map (DE), and counting the number of people in the crowd estimation density map; and 3, constructing a bidirectional ConvLSTM network (Bi-ConvLSTM network) based on residual connection. Inputting the obtained crowd estimation density map (DE) into a Bi-ConvLSTM network in a stacking manner for a plurality of frames, predicting the crowd estimation density map (PE) at the T +1 moment, and estimating the number of pedestrians at the T +1 moment; and 4, pre-training the multi-scale pyramid cavity convolution network Model (MPDC) in the step 2, storing model parameters, storing the obtained crowd estimation density map (DE), and inputting the stored crowd estimation density map into the Bi-ConvLSTM network based on residual connection in the step 3 for training.

The method comprises the following specific steps:

the general steps of the step 1 are as follows:

and (3) converting the head position label in the input video image data set into a truth-value density graph by utilizing a two-dimensional Gaussian convolution kernel, and using the truth-value density graph as a training set with truth values for calculating loss difference. In order to make the truth density map better correspond to the dense crowd images at different viewing angles, the truth density map based on the geometric adaptive gaussian kernel is selected and represented by the following formula:

The above operations convert the original image with the head label into a true density map, and train as a contrast training set of the convolutional neural network.

The specific steps of the step 2 are as follows:

as shown in FIG. 2, the continuous video images in the video image data set generate estimated crowd density maps through a Multiscale Pyramid hole convolutional network Model (MPDC). The Multiscale Pyramid hole convolutional network Model (MPDC) is divided into two parts, the first part is VGG-Basic, a VGG-16 network is used as a Basic framework, only the first 10 convolutional layers and the first 3 maximum pooling layers are reserved, the rest layers are all removed, the sizes of convolutional cores are all set to be 3 × 3, the number of channels is sequentially set to be 64, 128, 256, 512 and 512, the pooling size is set to be 2 × 2, the structure of the second part is composed of four groups of parallel convolutional holes with different hole rates, feature maps for respectively generating 4 different receptive field multi-scale information are sequentially set, and the hole rate maps are sequentially setIs r is 2^lAnd (l is 1,2,3 and 4), each hollow convolutional layer comprises 5 convolutional layers, the size of the core is set to be C is 3, the number of channels is set to be 512, 256 and 128 in sequence, then, the four groups of output feature maps and the feature map output by the VGG-Basic are spliced on the channels, 1 × 1 convolutional layer is adopted to carry out feature dimension reduction and is mapped into a crowd estimation density map (DE), and the crowd estimation density map (DE) is subjected to integral statistics on the number of real-time people.

The specific steps of the step 3 are as follows:

as shown in fig. 3, a Bi-ConvLSTM network based on residual concatenation is proposed for the density map prediction of consecutive video images. And (3) inputting the crowd estimated density map (DE) obtained in the step (2) into a Bi-ConvLSTM network based on residual connection for reconstruction and prediction, and inputting the crowd estimated density map sequence at the continuous time of { T-T., T-1, T } into the Bi-ConvLSTM network based on residual connection. The Bi-ConvLSTM network based on residual connection uses ConvLSTM as a basic structure, the ConvLSTM is replaced by a bidirectional ConvLSTM structure, an input crowd estimation density graph is calculated by forward and reverse superposition of two ConvLSTM units, and the output crowd estimation density graph comprises forward sequence information and reverse sequence information, is used for reconstructing a crowd estimation density graph sequence and predicting a future video frame sequence. For example, in data processing, 9 frames of pictures are input, the 5 th frame is predicted from 1 to 4 frames, the 5 th frame is predicted from 9 to 6 frames, and the results of the prediction are combined to obtain the result of the predicted final 5 th frame. The spatio-temporal sequence prediction problem is to predict the most likely K video frame sequences in the future from the previous J training video frames,

wherein the previous frame

Most likely future frame F_m＝{X_t+1,X_t+2...,X_t+KPredicting future frames

t representsAt the current moment, J denotes the number of previous frames, K denotes the number of predicted frames, and σ () is a softmax function.

The Bi-ConvLSTM network is composed of bidirectional ConvLSTM, BN, ReLU activation functions and residual connection structures. And finally, obtaining a crowd prediction density map (FP) at the T +1 moment through convolution and ReLU function activation, and carrying out integral statistics on the flow of people at the T +1 moment.

The specific content of the step 4 is as follows:

as shown in fig. 4, the training process: pre-training the multi-scale pyramid cavity convolution network Model (MPDC) in the step 2, storing model parameters, storing the obtained crowd estimation density map (DE) as the input of the step 3, and inputting the crowd estimation density map into a Bi-ConvLSTM network based on residual connection for training. The loss between the population predicted density map (FP) and the true density map (GT) is calculated using Euclidean distances, and the present invention employs a random gradient descent algorithm to optimize parameters until the loss value converges to the predicted value.

a truth density plot for the ith input picture pair is shown. Θ represents the network parameters to be learned.

Claims

1. A method for crowd counting and future people flow prediction based on video images is characterized by comprising the following steps:

step 2, building a multi-scale pyramid hole convolution network model, inputting continuous video frames in a video image data set into the multi-scale pyramid hole convolution network model, fully extracting a characteristic diagram with multi-scale information, mapping the characteristic diagram into a crowd estimation density diagram, integrating the crowd estimation density diagram, and counting the number of people in the crowd estimation density diagram;

the multi-scale pyramid cavity convolution network model is divided into two parts: the structure of the first part is a VGG-Basic network; the second part is composed of four parallel cavity convolution layers with different cavity rates; channel splicing is carried out on the characteristic diagram output by each branch and the characteristic diagram output by the VGG-Basic network;

step 3, constructing a Bi-ConvLSTM network based on residual connection; inputting the obtained crowd estimated density map into a Bi-ConvLSTM network in a stacking manner for predicting a crowd predicted density map at the time of T +1 and estimating the number of pedestrians at the time of T + 1;

the bidirectional ConvLSTM module is an improvement on the traditional ConvLSTM, an input crowd estimation density graph is calculated by the forward and reverse superposition of two ConvLSTM units, and the output contains forward sequence information and reverse sequence information;

step 4, pre-training the multi-scale pyramid cavity convolution network model in the step 2, storing model parameters, storing the obtained crowd estimation density map, and inputting the crowd estimation density map into the Bi-ConvLSTM network based on residual connection in the step 3 for training; and (3) optimizing parameters in the multi-scale pyramid cavity convolution network model and the Bi-ConvLSTM network by adopting a random gradient descent algorithm, and calculating the loss between the crowd prediction density graph and the truth density graph by using Euclidean distance.

2. The method of claim 1, wherein the step 1 comprises the following steps:

the method comprises the steps that a two-dimensional Gaussian convolution kernel is utilized to convert head position labels in an input video image data set into a true value density graph, and the true value density graph is used as a training set with a true value for calculating loss difference;

the truth value density chart is obtained by convolution of a delta pulse function and a Gaussian function, and the convolution is performed firstly and then the summation is performed; x is the number of_iRepresenting the position of the ith individual's head in the image, i.e. the pixel coordinates of the ith individual's head in the image, (x-x)_i) Pulse function representing the position of the head in an image, N₁The total number of the human heads in the image;

is a distance x from the head position_iThe average distance of the nearest m head positions proves that β is 0.3, and the effect is the best;

3. The method for crowd counting and future people flow prediction based on video images as claimed in claim 1 or 2, wherein the specific steps of the step 2 are:

generating an estimated crowd density map by continuous video images in the video image data set through a multi-scale pyramid cavity convolution network model;

the second part is composed of four parallel cavity convolution layers with different cavity rates, and four characteristic graphs of different receptive field multi-scale information are respectively generated; wherein the voidage is sequentially set to r-2^lAnd l is 1,2,3 and 4, each hollow convolutional layer comprises 5 convolutional layers, the size of a convolutional kernel is set to be C3, then the feature maps output by the four branches and the feature map output by the VGG-Basic are spliced on a channel, 1 × 1 convolutional layer is adopted to carry out feature dimension reduction and is mapped into a crowd estimation density map (DE), and the integral statistics is carried out on the crowd estimation density map (DE).

4. The method of claim 3, wherein the step 3 comprises the following steps:

aiming at the density map prediction of continuous video images, a Bi-ConvLSTM network based on residual error connection is proposed; inputting the crowd estimated density map obtained in the step 2 into a Bi-ConvLSTM network based on residual connection for reconstruction and prediction, and inputting a crowd estimated density map sequence at { T-T., T-1, T } continuous time into the Bi-ConvLSTM network based on residual connection;

the Bi-ConvLSTM network based on residual connection uses ConvLSTM as a basic structure, the ConvLSTM is replaced by a bidirectional ConvLSTM structure, an input crowd estimation density graph is calculated by forward and reverse superposition of two ConvLSTM units, and an output characteristic graph comprises forward sequence information and reverse sequence information and is used for reconstructing a crowd estimation density graph sequence and predicting a future video frame sequence;

wherein the previous frame

Most likely future frame F_m＝{X_t+1,X_t+2...,X_t+KPredicting future frames

t represents the current time, J represents the number of previous frames, K represents the number of predicted frames, σ (·) is the softmax function;

the Bi-ConvLSTM network consists of bidirectional ConvLSTM, BN, ReLU activation functions and a residual connection structure; and finally, obtaining a crowd prediction density map at the T +1 moment through convolution and ReLU function activation, and performing integral statistics on the pedestrian volume at the T +1 moment.

5. The method according to claim 3 or 4, wherein the step 4 comprises:

training process: pre-training the multi-scale pyramid cavity convolution network model in the step 2, storing model parameters, storing the obtained crowd estimation density map as input in the step 3, and inputting the crowd estimation density map into a Bi-ConvLSTM network based on residual connection for training;

calculating the loss between the crowd prediction density map and the truth value density map by using Euclidean distance, and optimizing parameters by adopting a random gradient descent algorithm until the loss value converges to the predicted value; when the distance between the crowd prediction density graph and the truth value density graph is generated by adopting Euclidean distance measurement, the loss function is defined as follows:

a truth density chart representing the ith input picture pair; theta represents a network parameter to be learned;

when the crowd prediction density graph is evaluated, the commonly used mean square error MSE and mean absolute error MAE are adopted, and the specific steps are as follows:

representing the real number of people corresponding to the ith input picture;