CN111611878A - Method for crowd counting and future people flow prediction based on video image - Google Patents
Method for crowd counting and future people flow prediction based on video image Download PDFInfo
- Publication number
- CN111611878A CN111611878A CN202010364590.1A CN202010364590A CN111611878A CN 111611878 A CN111611878 A CN 111611878A CN 202010364590 A CN202010364590 A CN 202010364590A CN 111611878 A CN111611878 A CN 111611878A
- Authority
- CN
- China
- Prior art keywords
- crowd
- density
- convlstm
- density map
- prediction
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/50—Context or environment of the image
- G06V20/52—Surveillance or monitoring of activities, e.g. for recognising suspicious objects
- G06V20/53—Recognition of crowd images, e.g. recognition of crowd congestion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
Abstract
The invention discloses a method for crowd counting and future people flow prediction based on video images. The invention comprises the following steps: 1. selecting a video image data set with annotation information, and performing Gaussian function processing according to the annotation of the head position to generate a true-value density chart; 2. and inputting the video frame into a constructed MPDC model to extract a characteristic map, and mapping the characteristic map into a crowd estimation density map (DE). 3. And inputting the obtained DE stacked frames into the constructed Bi-ConvLSTM network, predicting a crowd prediction density map at the T +1 moment, and estimating the number of pedestrians at the T +1 moment. The method adopts a convolution network based on multi-scale pyramid holes and a Bi-ConvLSTM network based on residual connection, uses continuous video frames to generate a crowd estimation density map, further predicts the crowd prediction density map of future frames, and counts the crowd quantity. The invention aims at the prediction of continuous video images, is a brand new method, and not only can obtain a real-time crowd density map and the number of people, but also can predict the crowd density map and the flow of people of a future frame.
Description
Technical Field
The invention belongs to the field of crowd image processing in computer vision, and particularly relates to a method for crowd counting and future people flow prediction based on a video image.
Background
People counting, that is, counting the number of people in a picture or a video sequence. People and crowd counting and forecasting have important significance for public safety management, regional space planning, information resource acquisition and the like. The system can better monitor and dredge the crowds in the public places, and provides a basis for reasonable scheduling of personnel, reasonable planning of routes, reasonable dredging of people streams and site selection of buildings.
The existing population counting methods can be divided into three categories: population counting methods based on detection, regression and density map estimation. The detection-based method is suitable for large and sparse target scenes. However, the method of regressing the number of people cannot estimate the congestion distribution, and spatial information of the position of the target is lost.
With the development of the crowd counting technology, the crowd counting algorithm is developed from simply calculating the number of pedestrians to obtaining a crowd estimated density map, so that the number of pedestrians can be counted, and meanwhile, the crowd density distribution condition can be obtained. Compared with the former two methods, the method based on the deep learning regression density map can effectively solve the crowd shielding problem to a certain extent, and can analyze the specific density distribution information of pedestrians through the density map.
However, under a complex background, in a scene in which high-density people gather, due to the influence of interference factors such as target shielding, perspective of view angle, scale change, uneven density distribution and the like caused by high overlapping between people, great resistance is generated to the counting of people and the acquisition of density information. In the existing method for generating the density map based on deep learning, multiple rows of convolution layers and large convolution kernels are adopted for extracting the features of the multi-scale image, so that a large number of parameters are generated, and the training difficulty of the network is increased. In addition, people flow prediction algorithm based on crowd video images is not researched.
Disclosure of Invention
The invention aims to provide a method for counting crowds and predicting the flow of people in the future based on a video image, aiming at the problems in the existing crowd counting field.
The technical scheme of the invention generally comprises the following steps:
The multi-scale pyramid cavity convolution network model is divided into two parts: the structure of the first part is a VGG-Basic network; the second part is composed of four parallel cavity convolution layers with different cavity rates; and channel splicing is carried out on the characteristic diagram of the output of each branch and the characteristic diagram of the output of the VGG-Basic network.
And 3, constructing a Bi-ConvLSTM network based on residual connection. And inputting the obtained crowd estimation density map (DE) into the Bi-ConvLSTM network in a stacking manner for a plurality of frames, predicting the crowd estimation density map (PE) at the T +1 moment, and estimating the number of pedestrians at the T +1 moment.
The bidirectional ConvLSTM module is an improvement of the traditional ConvLSTM, an input crowd estimation density map (DE) is calculated by the forward and reverse superposition of two ConvLSTM units, and output contains forward sequence information and reverse sequence information.
And 4, pre-training the multi-scale pyramid cavity convolution network Model (MPDC) in the step 2, storing model parameters, storing the obtained crowd estimation density map (DE), and inputting the stored crowd estimation density map into the Bi-ConvLSTM network based on residual connection in the step 3 for training. A random gradient descent algorithm is adopted to optimize parameters in a multi-scale pyramid hole convolution network model and a Bi-ConvLSTM network, and Euclidean distance is used for calculating loss between a crowd prediction density map (PE) and a truth density map.
Preferably, the specific steps of step 1 are as follows:
and (3) converting the head position label in the input video image data set into a truth-value density graph by utilizing a two-dimensional Gaussian convolution kernel, and using the truth-value density graph as a training set with truth values for calculating loss difference.
In order to better correspond the truth density map to the dense crowd images at different viewing angles, the truth density map based on the geometric adaptive gaussian kernel is selected and represented by the following formula:
the truth density map is obtained by convolving the delta pulse function with a Gaussian function, and summing after convolution. x is the number ofiRepresenting the position of the ith individual's head in the image, i.e. the pixel coordinates of the ith individual's head in the image, (x-x)i) Pulse function representing the position of the head in an image, N1The total number of the human heads in the image;
is a distance x from the head positioniThe average distance of the nearest m head positions proves that β is 0.3 to be the best effect.Representing the position x of the human headiTo the head position xjThe distance of (c).
The processing of step 1 converts the original image with head annotation in the video image data set into a true density map.
Preferably, the specific steps of step 2 are:
and generating an estimated crowd density map by continuous video images in the video image data set through a multi-scale pyramid cavity convolution network model.
The multi-scale pyramid cavity convolution network model is divided into two parts:
the first part is VGG-Basic, which takes a VGG-16 network as a Basic framework, only the first 10 convolutional layers and the first 3 maximum pooling layers are reserved, and the rest layers are completely removed;
the second part is composed of four parallel cavity convolution layers with different cavity rates, and four characteristic graphs of different receptive field multi-scale information are respectively generated; wherein the voidage is sequentially set to r-2lAnd l is 1,2,3 and 4, each hole convolutional layer has 5 convolutional layers, the size of a convolutional kernel is set to be C3, then the feature maps output by the four branches and the feature map output by the VGG-Basic are spliced on a channel, 1 × 1 convolutional layer is adopted to carry out feature dimension reduction and is mapped into a crowd estimation density map (DE), and the number of people is integrated and counted on the crowd estimation density map (DE).
Preferably, the specific steps of step 3 are:
for the density map prediction of consecutive video images, a Bi-ConvLSTM network based on residual concatenation is proposed. And (3) inputting the crowd estimated density map (DE) obtained in the step (2) into a Bi-ConvLSTM network based on residual connection for reconstruction and prediction, and inputting the crowd estimated density map sequence at the continuous time of { T-T., T-1, T } into the Bi-ConvLSTM network based on residual connection.
The Bi-ConvLSTM network based on residual connection uses ConvLSTM as a basic structure, the ConvLSTM is replaced by a bidirectional ConvLSTM structure, an input crowd estimation density graph is calculated by forward and reverse superposition of two ConvLSTM units, and an output characteristic graph comprises forward sequence information and reverse sequence information and is used for reconstructing a crowd estimation density graph sequence and predicting a future video frame sequence.
The spatio-temporal sequence prediction problem is to predict the most likely K video frame sequences in the future from the previous J training video frames,
wherein the previous frameMost likely future frame Fm={Xt+1,Xt+2...,Xt+KPredicting future framest denotes the current time, J denotes the number of previous frames, K denotes the number of predicted frames, and σ () is a softmax function.
The Bi-ConvLSTM network is composed of bidirectional ConvLSTM, BN, ReLU activation functions and residual connection structures. And finally, obtaining a crowd prediction density map at the T +1 moment through convolution and ReLU function activation, and performing integral statistics on the pedestrian volume at the T +1 moment.
Preferably, the specific content of step 4 is:
training process: pre-training the multi-scale pyramid cavity convolution network Model (MPDC) in the step 2, storing model parameters, storing the obtained crowd estimation density map (DE) as the input of the step 3, and inputting the crowd estimation density map into a Bi-ConvLSTM network based on residual connection for training.
The loss between the crowd predicted density map and the truth density map is calculated by using Euclidean distance, and the invention adopts a random gradient descent algorithm to optimize parameters until the loss value converges to the predicted value.
The euclidean distance is used to measure the difference between the predicted and true density maps of the population. When the distance between the crowd prediction density graph and the truth value density graph is generated by adopting Euclidean distance measurement, the loss function is defined as follows:
wherein N represents the number of pictures input into the multi-scale pyramid hole convolution network model, and Z (X)i(ii) a Theta) is a crowd density estimation graph corresponding to the ith input picture, XiWhich represents the i-th input picture,is shown asThe truth density map of i input picture pairs. Θ represents the network parameters to be learned.
When the crowd predicted density map is evaluated, the commonly used Mean Square Error (MSE) and Mean Absolute Error (MAE) are adopted, the MSE is used for describing the accuracy of the crowd predicted density map, the accuracy is higher when the MSE is smaller, and the MAE can reflect the error condition of the crowd predicted density map.
N represents the number of pictures input into the multi-scale pyramid hole convolution network model, CiThe predicted number of people in the crowd predicted density graph corresponding to the ith input picture is shown,and the real number of people corresponding to the ith input picture is represented.
The testing process comprises the following steps: and selecting a new continuous video frame data set, inputting the new continuous video frame data set into the trained model for testing, outputting a crowd prediction density graph, and counting results.
The invention has the following beneficial effects:
the method adopts a Bi-ConvLSTM network based on multi-scale pyramid hole convolution network and residual connection, uses continuous video frames to generate an estimated crowd estimation density map (DE), predicts a crowd density map (FP) of a future frame, and further predicts the flow of people. The invention aims at predicting the density map and the pedestrian volume of the crowd in the video image target, and is a brand new method. The method comprises the steps that a cavity convolution network is selected from a multi-scale pyramid cavity convolution network to replace the traditional convolution-pooling-upsampling process, the receptive field is expanded while the precision is not lost, the target is accurately positioned, four groups of parallel cavity convolutions are adopted to form a pyramid mode, and the characteristics of an image are fully extracted by utilizing the receptive fields with different sizes; by fusing the outputs of different convolution layers, the learned features have more complete representation on the image; in a Bi-ConvLSTM network based on residual connection, a density map of a future frame is predicted by utilizing the strong space-time feature extraction capability of the ConvLSTM network, the ConvLSTM is replaced by the bidirectional ConvLSTM, the input density map is calculated by the forward and reverse superposition of two ConvLSTM units, and the output contains forward sequence information and reverse sequence information. Compared with the existing crowd counting technology, the method provided by the invention is used for counting the crowd of the video image, so that not only can a real-time crowd density map and the number of people be obtained, but also the crowd density map and the flow of people of a future frame can be predicted.
Drawings
FIG. 1 is a flow chart of the overall text of the network of the present invention;
FIG. 2 is a network structure for generating a density map based on a multi-scale pyramid hole convolution network;
FIG. 3 is a network structure of a residual concatenation based Bi-ConvLSTM predicted future frame density map;
FIG. 4 is a network structure of a bidirectional ConvLSTM module according to the present invention;
FIG. 5 is a flowchart of the network model training process of the present invention.
Detailed Description
The invention provides a method for crowd counting and people flow prediction based on a video image. 1) A VGG-Biasc structure is selected to perform preliminary feature extraction, and the VGG-Biasc structure is composed of a series of Convolutional Neural Networks (CNN) with a plurality of layers of small convolutional kernels, so that the image feature characterization capability is strong, and the Network training parameters are simplified; 2) a cavity convolution network is selected to replace the traditional convolution-pooling-up-sampling process, so that the receptive field is enlarged and the target is accurately positioned without losing precision, four groups of parallel cavity convolution layers are adopted to form a pyramid mode, and the multi-scale information of the image is acquired by using different receptive fields; 3) by fusing the outputs of different convolution layers, the learned features have more complete representation on the image; 4) the video frame density map at future time instants is predicted by a Bi-directional convolution long short term memory (Bi-ConvLSTM network) based on residual concatenation. The method comprises the steps of selecting t density map stacks as input according to the step length, predicting the density map of a future frame by utilizing the strong space-time feature extraction capability of a ConvLSTM network, and predicting the crowd quantity at the future moment. Compared with the existing crowd counting technology, the method provided by the invention is used for counting the crowd of the video image, so that not only can a real-time crowd density map and the number of people be obtained, but also the crowd density map and the flow of people of a future frame can be predicted.
As shown in fig. 1, step 1, selecting a video image data set with artificial annotation information, and performing gaussian function processing according to annotation of a head position in an image, so as to generate a true density map corresponding to an original image; step 2, firstly, a multi-scale Pyramid hole Convolution network Model (MPDC) is built. Inputting continuous video frames in a video image data set into an MPDC model, fully extracting a characteristic diagram with multi-scale information, mapping the characteristic diagram into a crowd estimation density map (DE), integrating the crowd estimation density map (DE), and counting the number of people in the crowd estimation density map; and 3, constructing a bidirectional ConvLSTM network (Bi-ConvLSTM network) based on residual connection. Inputting the obtained crowd estimation density map (DE) into a Bi-ConvLSTM network in a stacking manner for a plurality of frames, predicting the crowd estimation density map (PE) at the T +1 moment, and estimating the number of pedestrians at the T +1 moment; and 4, pre-training the multi-scale pyramid cavity convolution network Model (MPDC) in the step 2, storing model parameters, storing the obtained crowd estimation density map (DE), and inputting the stored crowd estimation density map into the Bi-ConvLSTM network based on residual connection in the step 3 for training.
The method comprises the following specific steps:
the general steps of the step 1 are as follows:
and (3) converting the head position label in the input video image data set into a truth-value density graph by utilizing a two-dimensional Gaussian convolution kernel, and using the truth-value density graph as a training set with truth values for calculating loss difference. In order to make the truth density map better correspond to the dense crowd images at different viewing angles, the truth density map based on the geometric adaptive gaussian kernel is selected and represented by the following formula:
the truth density map is obtained by convolving the delta pulse function with a Gaussian function, and summing after convolution. x is the number ofiRepresenting the position of the ith individual's head in the image, i.e. the pixel coordinates of the ith individual's head in the image, (x-x)i) Pulse function representing the position of the head in an image, N1The total number of the human heads in the image;
is a distance x from the head positioniThe average distance of the nearest m head positions proves that β is 0.3 to be the best effect.Representing the position x of the human headiTo the head position xjThe distance of (c).
The above operations convert the original image with the head label into a true density map, and train as a contrast training set of the convolutional neural network.
The specific steps of the step 2 are as follows:
as shown in FIG. 2, the continuous video images in the video image data set generate estimated crowd density maps through a Multiscale Pyramid hole convolutional network Model (MPDC). The Multiscale Pyramid hole convolutional network Model (MPDC) is divided into two parts, the first part is VGG-Basic, a VGG-16 network is used as a Basic framework, only the first 10 convolutional layers and the first 3 maximum pooling layers are reserved, the rest layers are all removed, the sizes of convolutional cores are all set to be 3 × 3, the number of channels is sequentially set to be 64, 128, 256, 512 and 512, the pooling size is set to be 2 × 2, the structure of the second part is composed of four groups of parallel convolutional holes with different hole rates, feature maps for respectively generating 4 different receptive field multi-scale information are sequentially set, and the hole rate maps are sequentially setIs r is 2lAnd (l is 1,2,3 and 4), each hollow convolutional layer comprises 5 convolutional layers, the size of the core is set to be C is 3, the number of channels is set to be 512, 256 and 128 in sequence, then, the four groups of output feature maps and the feature map output by the VGG-Basic are spliced on the channels, 1 × 1 convolutional layer is adopted to carry out feature dimension reduction and is mapped into a crowd estimation density map (DE), and the crowd estimation density map (DE) is subjected to integral statistics on the number of real-time people.
The specific steps of the step 3 are as follows:
as shown in fig. 3, a Bi-ConvLSTM network based on residual concatenation is proposed for the density map prediction of consecutive video images. And (3) inputting the crowd estimated density map (DE) obtained in the step (2) into a Bi-ConvLSTM network based on residual connection for reconstruction and prediction, and inputting the crowd estimated density map sequence at the continuous time of { T-T., T-1, T } into the Bi-ConvLSTM network based on residual connection. The Bi-ConvLSTM network based on residual connection uses ConvLSTM as a basic structure, the ConvLSTM is replaced by a bidirectional ConvLSTM structure, an input crowd estimation density graph is calculated by forward and reverse superposition of two ConvLSTM units, and the output crowd estimation density graph comprises forward sequence information and reverse sequence information, is used for reconstructing a crowd estimation density graph sequence and predicting a future video frame sequence. For example, in data processing, 9 frames of pictures are input, the 5 th frame is predicted from 1 to 4 frames, the 5 th frame is predicted from 9 to 6 frames, and the results of the prediction are combined to obtain the result of the predicted final 5 th frame. The spatio-temporal sequence prediction problem is to predict the most likely K video frame sequences in the future from the previous J training video frames,
wherein the previous frameMost likely future frame Fm={Xt+1,Xt+2...,Xt+KPredicting future framest representsAt the current moment, J denotes the number of previous frames, K denotes the number of predicted frames, and σ () is a softmax function.
The Bi-ConvLSTM network is composed of bidirectional ConvLSTM, BN, ReLU activation functions and residual connection structures. And finally, obtaining a crowd prediction density map (FP) at the T +1 moment through convolution and ReLU function activation, and carrying out integral statistics on the flow of people at the T +1 moment.
The specific content of the step 4 is as follows:
as shown in fig. 4, the training process: pre-training the multi-scale pyramid cavity convolution network Model (MPDC) in the step 2, storing model parameters, storing the obtained crowd estimation density map (DE) as the input of the step 3, and inputting the crowd estimation density map into a Bi-ConvLSTM network based on residual connection for training. The loss between the population predicted density map (FP) and the true density map (GT) is calculated using Euclidean distances, and the present invention employs a random gradient descent algorithm to optimize parameters until the loss value converges to the predicted value.
The euclidean distance is used to measure the difference between the predicted and true density maps of the population. When the distance between the crowd prediction density graph and the truth value density graph is generated by adopting Euclidean distance measurement, the loss function is defined as follows:
wherein N represents the number of pictures input into the multi-scale pyramid hole convolution network model, and Z (X)i(ii) a Theta) is a crowd density estimation graph corresponding to the ith input picture, XiWhich represents the i-th input picture,a truth density plot for the ith input picture pair is shown. Θ represents the network parameters to be learned.
When the crowd predicted density map is evaluated, the commonly used Mean Square Error (MSE) and Mean Absolute Error (MAE) are adopted, the MSE is used for describing the accuracy of the crowd predicted density map, the accuracy is higher when the MSE is smaller, and the MAE can reflect the error condition of the crowd predicted density map.
N represents the number of pictures input into the multi-scale pyramid hole convolution network model, CiThe predicted number of people in the crowd predicted density graph corresponding to the ith input picture is shown,and the real number of people corresponding to the ith input picture is represented.
The testing process comprises the following steps: and selecting a new continuous video frame data set, inputting the new continuous video frame data set into the trained model for testing, outputting a crowd prediction density graph, and counting results.
Claims (5)
1. A method for crowd counting and future people flow prediction based on video images is characterized by comprising the following steps:
step 1, selecting a video image data set with artificial annotation information, and performing Gaussian function processing according to annotation of head positions in the image to generate a true value density chart corresponding to an original image;
step 2, building a multi-scale pyramid hole convolution network model, inputting continuous video frames in a video image data set into the multi-scale pyramid hole convolution network model, fully extracting a characteristic diagram with multi-scale information, mapping the characteristic diagram into a crowd estimation density diagram, integrating the crowd estimation density diagram, and counting the number of people in the crowd estimation density diagram;
the multi-scale pyramid cavity convolution network model is divided into two parts: the structure of the first part is a VGG-Basic network; the second part is composed of four parallel cavity convolution layers with different cavity rates; channel splicing is carried out on the characteristic diagram output by each branch and the characteristic diagram output by the VGG-Basic network;
step 3, constructing a Bi-ConvLSTM network based on residual connection; inputting the obtained crowd estimated density map into a Bi-ConvLSTM network in a stacking manner for predicting a crowd predicted density map at the time of T +1 and estimating the number of pedestrians at the time of T + 1;
the bidirectional ConvLSTM module is an improvement on the traditional ConvLSTM, an input crowd estimation density graph is calculated by the forward and reverse superposition of two ConvLSTM units, and the output contains forward sequence information and reverse sequence information;
step 4, pre-training the multi-scale pyramid cavity convolution network model in the step 2, storing model parameters, storing the obtained crowd estimation density map, and inputting the crowd estimation density map into the Bi-ConvLSTM network based on residual connection in the step 3 for training; and (3) optimizing parameters in the multi-scale pyramid cavity convolution network model and the Bi-ConvLSTM network by adopting a random gradient descent algorithm, and calculating the loss between the crowd prediction density graph and the truth density graph by using Euclidean distance.
2. The method of claim 1, wherein the step 1 comprises the following steps:
the method comprises the steps that a two-dimensional Gaussian convolution kernel is utilized to convert head position labels in an input video image data set into a true value density graph, and the true value density graph is used as a training set with a true value for calculating loss difference;
in order to better correspond the truth density map to the dense crowd images at different viewing angles, the truth density map based on the geometric adaptive gaussian kernel is selected and represented by the following formula:
the truth value density chart is obtained by convolution of a delta pulse function and a Gaussian function, and the convolution is performed firstly and then the summation is performed; x is the number ofiRepresenting the position of the ith individual's head in the image, i.e. the pixel coordinates of the ith individual's head in the image, (x-x)i) Pulse function representing the position of the head in an image, N1The total number of the human heads in the image;
3. The method for crowd counting and future people flow prediction based on video images as claimed in claim 1 or 2, wherein the specific steps of the step 2 are:
generating an estimated crowd density map by continuous video images in the video image data set through a multi-scale pyramid cavity convolution network model;
the multi-scale pyramid cavity convolution network model is divided into two parts:
the first part is VGG-Basic, which takes a VGG-16 network as a Basic framework, only the first 10 convolutional layers and the first 3 maximum pooling layers are reserved, and the rest layers are completely removed;
the second part is composed of four parallel cavity convolution layers with different cavity rates, and four characteristic graphs of different receptive field multi-scale information are respectively generated; wherein the voidage is sequentially set to r-2lAnd l is 1,2,3 and 4, each hollow convolutional layer comprises 5 convolutional layers, the size of a convolutional kernel is set to be C3, then the feature maps output by the four branches and the feature map output by the VGG-Basic are spliced on a channel, 1 × 1 convolutional layer is adopted to carry out feature dimension reduction and is mapped into a crowd estimation density map (DE), and the integral statistics is carried out on the crowd estimation density map (DE).
4. The method of claim 3, wherein the step 3 comprises the following steps:
aiming at the density map prediction of continuous video images, a Bi-ConvLSTM network based on residual error connection is proposed; inputting the crowd estimated density map obtained in the step 2 into a Bi-ConvLSTM network based on residual connection for reconstruction and prediction, and inputting a crowd estimated density map sequence at { T-T., T-1, T } continuous time into the Bi-ConvLSTM network based on residual connection;
the Bi-ConvLSTM network based on residual connection uses ConvLSTM as a basic structure, the ConvLSTM is replaced by a bidirectional ConvLSTM structure, an input crowd estimation density graph is calculated by forward and reverse superposition of two ConvLSTM units, and an output characteristic graph comprises forward sequence information and reverse sequence information and is used for reconstructing a crowd estimation density graph sequence and predicting a future video frame sequence;
the spatio-temporal sequence prediction problem is to predict the most likely K video frame sequences in the future from the previous J training video frames,
wherein the previous frameMost likely future frame Fm={Xt+1,Xt+2...,Xt+KPredicting future framest represents the current time, J represents the number of previous frames, K represents the number of predicted frames, σ (·) is the softmax function;
the Bi-ConvLSTM network consists of bidirectional ConvLSTM, BN, ReLU activation functions and a residual connection structure; and finally, obtaining a crowd prediction density map at the T +1 moment through convolution and ReLU function activation, and performing integral statistics on the pedestrian volume at the T +1 moment.
5. The method according to claim 3 or 4, wherein the step 4 comprises:
training process: pre-training the multi-scale pyramid cavity convolution network model in the step 2, storing model parameters, storing the obtained crowd estimation density map as input in the step 3, and inputting the crowd estimation density map into a Bi-ConvLSTM network based on residual connection for training;
calculating the loss between the crowd prediction density map and the truth value density map by using Euclidean distance, and optimizing parameters by adopting a random gradient descent algorithm until the loss value converges to the predicted value; when the distance between the crowd prediction density graph and the truth value density graph is generated by adopting Euclidean distance measurement, the loss function is defined as follows:
wherein N represents the number of pictures input into the multi-scale pyramid hole convolution network model, and Z (X)i(ii) a Theta) is a crowd density estimation graph corresponding to the ith input picture, XiWhich represents the i-th input picture,a truth density chart representing the ith input picture pair; theta represents a network parameter to be learned;
when the crowd prediction density graph is evaluated, the commonly used mean square error MSE and mean absolute error MAE are adopted, and the specific steps are as follows:
n represents the number of pictures input into the multi-scale pyramid hole convolution network model, CiThe predicted number of people in the crowd predicted density graph corresponding to the ith input picture is shown,representing the real number of people corresponding to the ith input picture;
the testing process comprises the following steps: and selecting a new continuous video frame data set, inputting the new continuous video frame data set into the trained model for testing, outputting a crowd prediction density graph, and counting results.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010364590.1A CN111611878B (en) | 2020-04-30 | 2020-04-30 | Method for crowd counting and future people flow prediction based on video image |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010364590.1A CN111611878B (en) | 2020-04-30 | 2020-04-30 | Method for crowd counting and future people flow prediction based on video image |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111611878A true CN111611878A (en) | 2020-09-01 |
CN111611878B CN111611878B (en) | 2022-07-22 |
Family
ID=72203064
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010364590.1A Active CN111611878B (en) | 2020-04-30 | 2020-04-30 | Method for crowd counting and future people flow prediction based on video image |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111611878B (en) |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112215129A (en) * | 2020-10-10 | 2021-01-12 | 江南大学 | Crowd counting method and system based on sequencing loss and double-branch network |
CN112380960A (en) * | 2020-11-11 | 2021-02-19 | 广东电力信息科技有限公司 | Crowd counting method, device, equipment and storage medium |
CN112418120A (en) * | 2020-11-27 | 2021-02-26 | 湖南师范大学 | Crowd detection method based on peak confidence map |
CN112541891A (en) * | 2020-12-08 | 2021-03-23 | 山东师范大学 | Crowd counting method and system based on void convolution high-resolution network |
CN112633106A (en) * | 2020-12-16 | 2021-04-09 | 苏州玖合智能科技有限公司 | Crowd characteristic recognition network construction and training method suitable for large depth of field |
CN112767451A (en) * | 2021-02-01 | 2021-05-07 | 福州大学 | Crowd distribution prediction method and system based on double-current convolutional neural network |
CN112861697A (en) * | 2021-02-03 | 2021-05-28 | 同济大学 | Crowd counting method and device based on picture self-symmetry crowd counting network |
CN113191301A (en) * | 2021-05-14 | 2021-07-30 | 上海交通大学 | Video dense crowd counting method and system integrating time sequence and spatial information |
CN113343790A (en) * | 2021-05-21 | 2021-09-03 | 中车唐山机车车辆有限公司 | Traffic hub passenger flow statistical method, device and storage medium |
CN113920733A (en) * | 2021-10-14 | 2022-01-11 | 齐鲁工业大学 | Traffic volume estimation method and system based on deep network |
CN114120233A (en) * | 2021-11-29 | 2022-03-01 | 上海应用技术大学 | Training method of lightweight pyramid hole convolution aggregation network for crowd counting |
CN114154620A (en) * | 2021-11-29 | 2022-03-08 | 上海应用技术大学 | Training method of crowd counting network |
US20220138475A1 (en) * | 2020-11-04 | 2022-05-05 | Tahmid Z CHOWDHURY | Methods and systems for crowd motion summarization via tracklet based human localization |
CN114499941A (en) * | 2021-12-22 | 2022-05-13 | 天翼云科技有限公司 | Training and detecting method of flow detection model and electronic equipment |
FR3116361A1 (en) * | 2020-11-18 | 2022-05-20 | Thales | Method for determining a density of elements in areas of an environment, associated computer program product |
CN114543312A (en) * | 2022-02-08 | 2022-05-27 | 珠海格力电器股份有限公司 | Fresh air equipment control method and device, computer equipment and medium |
CN117058627A (en) * | 2023-10-13 | 2023-11-14 | 阳光学院 | Public place crowd safety distance monitoring method, medium and system |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107862261A (en) * | 2017-10-25 | 2018-03-30 | 天津大学 | Image people counting method based on multiple dimensioned convolutional neural networks |
CN108388852A (en) * | 2018-02-09 | 2018-08-10 | 北京天元创新科技有限公司 | A kind of region crowd density prediction technique and device based on deep learning |
CN108615027A (en) * | 2018-05-11 | 2018-10-02 | 常州大学 | A method of video crowd is counted based on shot and long term memory-Weighted Neural Network |
CN109101930A (en) * | 2018-08-18 | 2018-12-28 | 华中科技大学 | A kind of people counting method and system |
CN109460855A (en) * | 2018-09-29 | 2019-03-12 | 中山大学 | A kind of throughput of crowded groups prediction model and method based on focus mechanism |
CN109558862A (en) * | 2018-06-15 | 2019-04-02 | 广州深域信息科技有限公司 | The people counting method and system of attention refinement frame based on spatial perception |
CN109815867A (en) * | 2019-01-14 | 2019-05-28 | 东华大学 | A kind of crowd density estimation and people flow rate statistical method |
US20190347476A1 (en) * | 2018-05-09 | 2019-11-14 | Korea Advanced Institute Of Science And Technology | Method for estimating human emotions using deep psychological affect network and system therefor |
US20200118423A1 (en) * | 2017-04-05 | 2020-04-16 | Carnegie Mellon University | Deep Learning Methods For Estimating Density and/or Flow of Objects, and Related Methods and Software |
-
2020
- 2020-04-30 CN CN202010364590.1A patent/CN111611878B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200118423A1 (en) * | 2017-04-05 | 2020-04-16 | Carnegie Mellon University | Deep Learning Methods For Estimating Density and/or Flow of Objects, and Related Methods and Software |
CN107862261A (en) * | 2017-10-25 | 2018-03-30 | 天津大学 | Image people counting method based on multiple dimensioned convolutional neural networks |
CN108388852A (en) * | 2018-02-09 | 2018-08-10 | 北京天元创新科技有限公司 | A kind of region crowd density prediction technique and device based on deep learning |
US20190347476A1 (en) * | 2018-05-09 | 2019-11-14 | Korea Advanced Institute Of Science And Technology | Method for estimating human emotions using deep psychological affect network and system therefor |
CN108615027A (en) * | 2018-05-11 | 2018-10-02 | 常州大学 | A method of video crowd is counted based on shot and long term memory-Weighted Neural Network |
CN109558862A (en) * | 2018-06-15 | 2019-04-02 | 广州深域信息科技有限公司 | The people counting method and system of attention refinement frame based on spatial perception |
CN109101930A (en) * | 2018-08-18 | 2018-12-28 | 华中科技大学 | A kind of people counting method and system |
CN109460855A (en) * | 2018-09-29 | 2019-03-12 | 中山大学 | A kind of throughput of crowded groups prediction model and method based on focus mechanism |
CN109815867A (en) * | 2019-01-14 | 2019-05-28 | 东华大学 | A kind of crowd density estimation and people flow rate statistical method |
Non-Patent Citations (4)
Title |
---|
FENG XIONG 等: ""Spatiotemporal Modeling for Crowd Counting in Videos"", 《ARXIV》 * |
SHANGHANG ZHANG 等: ""FCN-rLSTM: Deep Spatio-Temporal Neural Networks for Vehicle Counting in City Cameras"", 《2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV)》 * |
YANYAN FANG 等: ""LOCALITY-CONSTRAINED SPATIAL TRANSFORMER NETWORK FOR VIDEO CROWD COUNTING"", 《ARXIV》 * |
刘旭: ""视频监控中的目标计数方法研究"", 《中国博士学位论文全文数据库 信息科技辑》 * |
Cited By (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112215129A (en) * | 2020-10-10 | 2021-01-12 | 江南大学 | Crowd counting method and system based on sequencing loss and double-branch network |
US20220138475A1 (en) * | 2020-11-04 | 2022-05-05 | Tahmid Z CHOWDHURY | Methods and systems for crowd motion summarization via tracklet based human localization |
US11348338B2 (en) * | 2020-11-04 | 2022-05-31 | Huawei Technologies Co., Ltd. | Methods and systems for crowd motion summarization via tracklet based human localization |
CN112380960A (en) * | 2020-11-11 | 2021-02-19 | 广东电力信息科技有限公司 | Crowd counting method, device, equipment and storage medium |
WO2022106556A1 (en) * | 2020-11-18 | 2022-05-27 | Thales | Method for determining a density of elements in regions of an environment, and associated computer program product |
FR3116361A1 (en) * | 2020-11-18 | 2022-05-20 | Thales | Method for determining a density of elements in areas of an environment, associated computer program product |
CN112418120A (en) * | 2020-11-27 | 2021-02-26 | 湖南师范大学 | Crowd detection method based on peak confidence map |
CN112418120B (en) * | 2020-11-27 | 2021-09-28 | 湖南师范大学 | Crowd detection method based on peak confidence map |
CN112541891A (en) * | 2020-12-08 | 2021-03-23 | 山东师范大学 | Crowd counting method and system based on void convolution high-resolution network |
CN112633106A (en) * | 2020-12-16 | 2021-04-09 | 苏州玖合智能科技有限公司 | Crowd characteristic recognition network construction and training method suitable for large depth of field |
CN112767451A (en) * | 2021-02-01 | 2021-05-07 | 福州大学 | Crowd distribution prediction method and system based on double-current convolutional neural network |
CN112767451B (en) * | 2021-02-01 | 2022-09-06 | 福州大学 | Crowd distribution prediction method and system based on double-current convolutional neural network |
CN112861697A (en) * | 2021-02-03 | 2021-05-28 | 同济大学 | Crowd counting method and device based on picture self-symmetry crowd counting network |
CN112861697B (en) * | 2021-02-03 | 2022-10-25 | 同济大学 | Crowd counting method and device based on picture self-symmetry crowd counting network |
CN113191301B (en) * | 2021-05-14 | 2023-04-18 | 上海交通大学 | Video dense crowd counting method and system integrating time sequence and spatial information |
CN113191301A (en) * | 2021-05-14 | 2021-07-30 | 上海交通大学 | Video dense crowd counting method and system integrating time sequence and spatial information |
CN113343790A (en) * | 2021-05-21 | 2021-09-03 | 中车唐山机车车辆有限公司 | Traffic hub passenger flow statistical method, device and storage medium |
CN113920733A (en) * | 2021-10-14 | 2022-01-11 | 齐鲁工业大学 | Traffic volume estimation method and system based on deep network |
CN114120233B (en) * | 2021-11-29 | 2024-04-16 | 上海应用技术大学 | Training method of lightweight pyramid cavity convolution aggregation network for crowd counting |
CN114120233A (en) * | 2021-11-29 | 2022-03-01 | 上海应用技术大学 | Training method of lightweight pyramid hole convolution aggregation network for crowd counting |
CN114154620A (en) * | 2021-11-29 | 2022-03-08 | 上海应用技术大学 | Training method of crowd counting network |
CN114499941B (en) * | 2021-12-22 | 2023-08-04 | 天翼云科技有限公司 | Training and detecting method of flow detection model and electronic equipment |
CN114499941A (en) * | 2021-12-22 | 2022-05-13 | 天翼云科技有限公司 | Training and detecting method of flow detection model and electronic equipment |
CN114543312A (en) * | 2022-02-08 | 2022-05-27 | 珠海格力电器股份有限公司 | Fresh air equipment control method and device, computer equipment and medium |
CN117058627A (en) * | 2023-10-13 | 2023-11-14 | 阳光学院 | Public place crowd safety distance monitoring method, medium and system |
CN117058627B (en) * | 2023-10-13 | 2023-12-26 | 阳光学院 | Public place crowd safety distance monitoring method, medium and system |
Also Published As
Publication number | Publication date |
---|---|
CN111611878B (en) | 2022-07-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111611878B (en) | Method for crowd counting and future people flow prediction based on video image | |
CN110781838B (en) | Multi-mode track prediction method for pedestrians in complex scene | |
CN108805083A (en) | The video behavior detection method of single phase | |
CN111476181B (en) | Human skeleton action recognition method | |
CN111563447B (en) | Crowd density analysis and detection positioning method based on density map | |
CN110852267B (en) | Crowd density estimation method and device based on optical flow fusion type deep neural network | |
CN110147743A (en) | Real-time online pedestrian analysis and number system and method under a kind of complex scene | |
CN111144329B (en) | Multi-label-based lightweight rapid crowd counting method | |
CN109858424A (en) | Crowd density statistical method, device, electronic equipment and storage medium | |
CN111191667B (en) | Crowd counting method based on multiscale generation countermeasure network | |
CN111783589B (en) | Complex scene crowd counting method based on scene classification and multi-scale feature fusion | |
CN106815563B (en) | Human body apparent structure-based crowd quantity prediction method | |
CN110059616A (en) | Pedestrian's weight identification model optimization method based on fusion loss function | |
CN112001278A (en) | Crowd counting model based on structured knowledge distillation and method thereof | |
CN110991317A (en) | Crowd counting method based on multi-scale perspective sensing type network | |
CN113139489A (en) | Crowd counting method and system based on background extraction and multi-scale fusion network | |
Wang et al. | Edge computing-enabled crowd density estimation based on lightweight convolutional neural network | |
CN113239904B (en) | High-resolution dense target counting method based on convolutional neural network | |
CN114187506A (en) | Remote sensing image scene classification method of viewpoint-aware dynamic routing capsule network | |
CN116170746B (en) | Ultra-wideband indoor positioning method based on depth attention mechanism and geometric information | |
CN112115786A (en) | Monocular vision odometer method based on attention U-net | |
CN113887536B (en) | Multi-stage efficient crowd density estimation method based on high-level semantic guidance | |
CN115965905A (en) | Crowd counting method and system based on multi-scale fusion convolutional network | |
CN115457464A (en) | Crowd counting method based on transformer and CNN | |
CN114445765A (en) | Crowd counting and density estimating method based on coding and decoding structure |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |