CN111611878A - Method for crowd counting and future people flow prediction based on video image - Google Patents

Method for crowd counting and future people flow prediction based on video image Download PDF

Info

Publication number
CN111611878A
CN111611878A CN202010364590.1A CN202010364590A CN111611878A CN 111611878 A CN111611878 A CN 111611878A CN 202010364590 A CN202010364590 A CN 202010364590A CN 111611878 A CN111611878 A CN 111611878A
Authority
CN
China
Prior art keywords
crowd
density
convlstm
density map
prediction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010364590.1A
Other languages
Chinese (zh)
Other versions
CN111611878B (en
Inventor
李小玉
翁立
赖晓平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202010364590.1A priority Critical patent/CN111611878B/en
Publication of CN111611878A publication Critical patent/CN111611878A/en
Application granted granted Critical
Publication of CN111611878B publication Critical patent/CN111611878B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • G06V20/53Recognition of crowd images, e.g. recognition of crowd congestion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Abstract

The invention discloses a method for crowd counting and future people flow prediction based on video images. The invention comprises the following steps: 1. selecting a video image data set with annotation information, and performing Gaussian function processing according to the annotation of the head position to generate a true-value density chart; 2. and inputting the video frame into a constructed MPDC model to extract a characteristic map, and mapping the characteristic map into a crowd estimation density map (DE). 3. And inputting the obtained DE stacked frames into the constructed Bi-ConvLSTM network, predicting a crowd prediction density map at the T +1 moment, and estimating the number of pedestrians at the T +1 moment. The method adopts a convolution network based on multi-scale pyramid holes and a Bi-ConvLSTM network based on residual connection, uses continuous video frames to generate a crowd estimation density map, further predicts the crowd prediction density map of future frames, and counts the crowd quantity. The invention aims at the prediction of continuous video images, is a brand new method, and not only can obtain a real-time crowd density map and the number of people, but also can predict the crowd density map and the flow of people of a future frame.

Description

Method for crowd counting and future people flow prediction based on video image
Technical Field
The invention belongs to the field of crowd image processing in computer vision, and particularly relates to a method for crowd counting and future people flow prediction based on a video image.
Background
People counting, that is, counting the number of people in a picture or a video sequence. People and crowd counting and forecasting have important significance for public safety management, regional space planning, information resource acquisition and the like. The system can better monitor and dredge the crowds in the public places, and provides a basis for reasonable scheduling of personnel, reasonable planning of routes, reasonable dredging of people streams and site selection of buildings.
The existing population counting methods can be divided into three categories: population counting methods based on detection, regression and density map estimation. The detection-based method is suitable for large and sparse target scenes. However, the method of regressing the number of people cannot estimate the congestion distribution, and spatial information of the position of the target is lost.
With the development of the crowd counting technology, the crowd counting algorithm is developed from simply calculating the number of pedestrians to obtaining a crowd estimated density map, so that the number of pedestrians can be counted, and meanwhile, the crowd density distribution condition can be obtained. Compared with the former two methods, the method based on the deep learning regression density map can effectively solve the crowd shielding problem to a certain extent, and can analyze the specific density distribution information of pedestrians through the density map.
However, under a complex background, in a scene in which high-density people gather, due to the influence of interference factors such as target shielding, perspective of view angle, scale change, uneven density distribution and the like caused by high overlapping between people, great resistance is generated to the counting of people and the acquisition of density information. In the existing method for generating the density map based on deep learning, multiple rows of convolution layers and large convolution kernels are adopted for extracting the features of the multi-scale image, so that a large number of parameters are generated, and the training difficulty of the network is increased. In addition, people flow prediction algorithm based on crowd video images is not researched.
Disclosure of Invention
The invention aims to provide a method for counting crowds and predicting the flow of people in the future based on a video image, aiming at the problems in the existing crowd counting field.
The technical scheme of the invention generally comprises the following steps:
step 1, selecting a video image data set with artificial annotation information, and performing Gaussian function processing according to annotation of head positions in the image to generate a true value density chart corresponding to an original image;
step 2, firstly, a multi-scale Pyramid hole convolution network Model (MPDC) is built. And inputting continuous video frames in the video image data set into the MPDC model, fully extracting a characteristic diagram with multi-scale information, mapping the characteristic diagram into a crowd estimation density map (DE), integrating the crowd estimation density map (DE), and counting the number of people in the crowd estimation density map.
The multi-scale pyramid cavity convolution network model is divided into two parts: the structure of the first part is a VGG-Basic network; the second part is composed of four parallel cavity convolution layers with different cavity rates; and channel splicing is carried out on the characteristic diagram of the output of each branch and the characteristic diagram of the output of the VGG-Basic network.
And 3, constructing a Bi-ConvLSTM network based on residual connection. And inputting the obtained crowd estimation density map (DE) into the Bi-ConvLSTM network in a stacking manner for a plurality of frames, predicting the crowd estimation density map (PE) at the T +1 moment, and estimating the number of pedestrians at the T +1 moment.
The bidirectional ConvLSTM module is an improvement of the traditional ConvLSTM, an input crowd estimation density map (DE) is calculated by the forward and reverse superposition of two ConvLSTM units, and output contains forward sequence information and reverse sequence information.
And 4, pre-training the multi-scale pyramid cavity convolution network Model (MPDC) in the step 2, storing model parameters, storing the obtained crowd estimation density map (DE), and inputting the stored crowd estimation density map into the Bi-ConvLSTM network based on residual connection in the step 3 for training. A random gradient descent algorithm is adopted to optimize parameters in a multi-scale pyramid hole convolution network model and a Bi-ConvLSTM network, and Euclidean distance is used for calculating loss between a crowd prediction density map (PE) and a truth density map.
Preferably, the specific steps of step 1 are as follows:
and (3) converting the head position label in the input video image data set into a truth-value density graph by utilizing a two-dimensional Gaussian convolution kernel, and using the truth-value density graph as a training set with truth values for calculating loss difference.
In order to better correspond the truth density map to the dense crowd images at different viewing angles, the truth density map based on the geometric adaptive gaussian kernel is selected and represented by the following formula:
Figure BDA0002476139910000031
the truth density map is obtained by convolving the delta pulse function with a Gaussian function, and summing after convolution. x is the number ofiRepresenting the position of the ith individual's head in the image, i.e. the pixel coordinates of the ith individual's head in the image, (x-x)i) Pulse function representing the position of the head in an image, N1The total number of the human heads in the image;
Figure BDA0002476139910000032
Figure BDA0002476139910000033
is a distance x from the head positioniThe average distance of the nearest m head positions proves that β is 0.3 to be the best effect.
Figure BDA0002476139910000034
Representing the position x of the human headiTo the head position xjThe distance of (c).
The processing of step 1 converts the original image with head annotation in the video image data set into a true density map.
Preferably, the specific steps of step 2 are:
and generating an estimated crowd density map by continuous video images in the video image data set through a multi-scale pyramid cavity convolution network model.
The multi-scale pyramid cavity convolution network model is divided into two parts:
the first part is VGG-Basic, which takes a VGG-16 network as a Basic framework, only the first 10 convolutional layers and the first 3 maximum pooling layers are reserved, and the rest layers are completely removed;
the second part is composed of four parallel cavity convolution layers with different cavity rates, and four characteristic graphs of different receptive field multi-scale information are respectively generated; wherein the voidage is sequentially set to r-2lAnd l is 1,2,3 and 4, each hole convolutional layer has 5 convolutional layers, the size of a convolutional kernel is set to be C3, then the feature maps output by the four branches and the feature map output by the VGG-Basic are spliced on a channel, 1 × 1 convolutional layer is adopted to carry out feature dimension reduction and is mapped into a crowd estimation density map (DE), and the number of people is integrated and counted on the crowd estimation density map (DE).
Preferably, the specific steps of step 3 are:
for the density map prediction of consecutive video images, a Bi-ConvLSTM network based on residual concatenation is proposed. And (3) inputting the crowd estimated density map (DE) obtained in the step (2) into a Bi-ConvLSTM network based on residual connection for reconstruction and prediction, and inputting the crowd estimated density map sequence at the continuous time of { T-T., T-1, T } into the Bi-ConvLSTM network based on residual connection.
The Bi-ConvLSTM network based on residual connection uses ConvLSTM as a basic structure, the ConvLSTM is replaced by a bidirectional ConvLSTM structure, an input crowd estimation density graph is calculated by forward and reverse superposition of two ConvLSTM units, and an output characteristic graph comprises forward sequence information and reverse sequence information and is used for reconstructing a crowd estimation density graph sequence and predicting a future video frame sequence.
The spatio-temporal sequence prediction problem is to predict the most likely K video frame sequences in the future from the previous J training video frames,
Figure BDA0002476139910000041
wherein the previous frame
Figure BDA0002476139910000042
Most likely future frame Fm={Xt+1,Xt+2...,Xt+KPredicting future frames
Figure BDA0002476139910000043
t denotes the current time, J denotes the number of previous frames, K denotes the number of predicted frames, and σ () is a softmax function.
The Bi-ConvLSTM network is composed of bidirectional ConvLSTM, BN, ReLU activation functions and residual connection structures. And finally, obtaining a crowd prediction density map at the T +1 moment through convolution and ReLU function activation, and performing integral statistics on the pedestrian volume at the T +1 moment.
Preferably, the specific content of step 4 is:
training process: pre-training the multi-scale pyramid cavity convolution network Model (MPDC) in the step 2, storing model parameters, storing the obtained crowd estimation density map (DE) as the input of the step 3, and inputting the crowd estimation density map into a Bi-ConvLSTM network based on residual connection for training.
The loss between the crowd predicted density map and the truth density map is calculated by using Euclidean distance, and the invention adopts a random gradient descent algorithm to optimize parameters until the loss value converges to the predicted value.
The euclidean distance is used to measure the difference between the predicted and true density maps of the population. When the distance between the crowd prediction density graph and the truth value density graph is generated by adopting Euclidean distance measurement, the loss function is defined as follows:
Figure BDA0002476139910000051
wherein N represents the number of pictures input into the multi-scale pyramid hole convolution network model, and Z (X)i(ii) a Theta) is a crowd density estimation graph corresponding to the ith input picture, XiWhich represents the i-th input picture,
Figure BDA0002476139910000052
is shown asThe truth density map of i input picture pairs. Θ represents the network parameters to be learned.
When the crowd predicted density map is evaluated, the commonly used Mean Square Error (MSE) and Mean Absolute Error (MAE) are adopted, the MSE is used for describing the accuracy of the crowd predicted density map, the accuracy is higher when the MSE is smaller, and the MAE can reflect the error condition of the crowd predicted density map.
Figure BDA0002476139910000053
Figure BDA0002476139910000054
N represents the number of pictures input into the multi-scale pyramid hole convolution network model, CiThe predicted number of people in the crowd predicted density graph corresponding to the ith input picture is shown,
Figure BDA0002476139910000055
and the real number of people corresponding to the ith input picture is represented.
The testing process comprises the following steps: and selecting a new continuous video frame data set, inputting the new continuous video frame data set into the trained model for testing, outputting a crowd prediction density graph, and counting results.
The invention has the following beneficial effects:
the method adopts a Bi-ConvLSTM network based on multi-scale pyramid hole convolution network and residual connection, uses continuous video frames to generate an estimated crowd estimation density map (DE), predicts a crowd density map (FP) of a future frame, and further predicts the flow of people. The invention aims at predicting the density map and the pedestrian volume of the crowd in the video image target, and is a brand new method. The method comprises the steps that a cavity convolution network is selected from a multi-scale pyramid cavity convolution network to replace the traditional convolution-pooling-upsampling process, the receptive field is expanded while the precision is not lost, the target is accurately positioned, four groups of parallel cavity convolutions are adopted to form a pyramid mode, and the characteristics of an image are fully extracted by utilizing the receptive fields with different sizes; by fusing the outputs of different convolution layers, the learned features have more complete representation on the image; in a Bi-ConvLSTM network based on residual connection, a density map of a future frame is predicted by utilizing the strong space-time feature extraction capability of the ConvLSTM network, the ConvLSTM is replaced by the bidirectional ConvLSTM, the input density map is calculated by the forward and reverse superposition of two ConvLSTM units, and the output contains forward sequence information and reverse sequence information. Compared with the existing crowd counting technology, the method provided by the invention is used for counting the crowd of the video image, so that not only can a real-time crowd density map and the number of people be obtained, but also the crowd density map and the flow of people of a future frame can be predicted.
Drawings
FIG. 1 is a flow chart of the overall text of the network of the present invention;
FIG. 2 is a network structure for generating a density map based on a multi-scale pyramid hole convolution network;
FIG. 3 is a network structure of a residual concatenation based Bi-ConvLSTM predicted future frame density map;
FIG. 4 is a network structure of a bidirectional ConvLSTM module according to the present invention;
FIG. 5 is a flowchart of the network model training process of the present invention.
Detailed Description
The invention provides a method for crowd counting and people flow prediction based on a video image. 1) A VGG-Biasc structure is selected to perform preliminary feature extraction, and the VGG-Biasc structure is composed of a series of Convolutional Neural Networks (CNN) with a plurality of layers of small convolutional kernels, so that the image feature characterization capability is strong, and the Network training parameters are simplified; 2) a cavity convolution network is selected to replace the traditional convolution-pooling-up-sampling process, so that the receptive field is enlarged and the target is accurately positioned without losing precision, four groups of parallel cavity convolution layers are adopted to form a pyramid mode, and the multi-scale information of the image is acquired by using different receptive fields; 3) by fusing the outputs of different convolution layers, the learned features have more complete representation on the image; 4) the video frame density map at future time instants is predicted by a Bi-directional convolution long short term memory (Bi-ConvLSTM network) based on residual concatenation. The method comprises the steps of selecting t density map stacks as input according to the step length, predicting the density map of a future frame by utilizing the strong space-time feature extraction capability of a ConvLSTM network, and predicting the crowd quantity at the future moment. Compared with the existing crowd counting technology, the method provided by the invention is used for counting the crowd of the video image, so that not only can a real-time crowd density map and the number of people be obtained, but also the crowd density map and the flow of people of a future frame can be predicted.
As shown in fig. 1, step 1, selecting a video image data set with artificial annotation information, and performing gaussian function processing according to annotation of a head position in an image, so as to generate a true density map corresponding to an original image; step 2, firstly, a multi-scale Pyramid hole Convolution network Model (MPDC) is built. Inputting continuous video frames in a video image data set into an MPDC model, fully extracting a characteristic diagram with multi-scale information, mapping the characteristic diagram into a crowd estimation density map (DE), integrating the crowd estimation density map (DE), and counting the number of people in the crowd estimation density map; and 3, constructing a bidirectional ConvLSTM network (Bi-ConvLSTM network) based on residual connection. Inputting the obtained crowd estimation density map (DE) into a Bi-ConvLSTM network in a stacking manner for a plurality of frames, predicting the crowd estimation density map (PE) at the T +1 moment, and estimating the number of pedestrians at the T +1 moment; and 4, pre-training the multi-scale pyramid cavity convolution network Model (MPDC) in the step 2, storing model parameters, storing the obtained crowd estimation density map (DE), and inputting the stored crowd estimation density map into the Bi-ConvLSTM network based on residual connection in the step 3 for training.
The method comprises the following specific steps:
the general steps of the step 1 are as follows:
and (3) converting the head position label in the input video image data set into a truth-value density graph by utilizing a two-dimensional Gaussian convolution kernel, and using the truth-value density graph as a training set with truth values for calculating loss difference. In order to make the truth density map better correspond to the dense crowd images at different viewing angles, the truth density map based on the geometric adaptive gaussian kernel is selected and represented by the following formula:
Figure BDA0002476139910000071
the truth density map is obtained by convolving the delta pulse function with a Gaussian function, and summing after convolution. x is the number ofiRepresenting the position of the ith individual's head in the image, i.e. the pixel coordinates of the ith individual's head in the image, (x-x)i) Pulse function representing the position of the head in an image, N1The total number of the human heads in the image;
Figure BDA0002476139910000081
Figure BDA0002476139910000082
is a distance x from the head positioniThe average distance of the nearest m head positions proves that β is 0.3 to be the best effect.
Figure BDA0002476139910000083
Representing the position x of the human headiTo the head position xjThe distance of (c).
The above operations convert the original image with the head label into a true density map, and train as a contrast training set of the convolutional neural network.
The specific steps of the step 2 are as follows:
as shown in FIG. 2, the continuous video images in the video image data set generate estimated crowd density maps through a Multiscale Pyramid hole convolutional network Model (MPDC). The Multiscale Pyramid hole convolutional network Model (MPDC) is divided into two parts, the first part is VGG-Basic, a VGG-16 network is used as a Basic framework, only the first 10 convolutional layers and the first 3 maximum pooling layers are reserved, the rest layers are all removed, the sizes of convolutional cores are all set to be 3 × 3, the number of channels is sequentially set to be 64, 128, 256, 512 and 512, the pooling size is set to be 2 × 2, the structure of the second part is composed of four groups of parallel convolutional holes with different hole rates, feature maps for respectively generating 4 different receptive field multi-scale information are sequentially set, and the hole rate maps are sequentially setIs r is 2lAnd (l is 1,2,3 and 4), each hollow convolutional layer comprises 5 convolutional layers, the size of the core is set to be C is 3, the number of channels is set to be 512, 256 and 128 in sequence, then, the four groups of output feature maps and the feature map output by the VGG-Basic are spliced on the channels, 1 × 1 convolutional layer is adopted to carry out feature dimension reduction and is mapped into a crowd estimation density map (DE), and the crowd estimation density map (DE) is subjected to integral statistics on the number of real-time people.
The specific steps of the step 3 are as follows:
as shown in fig. 3, a Bi-ConvLSTM network based on residual concatenation is proposed for the density map prediction of consecutive video images. And (3) inputting the crowd estimated density map (DE) obtained in the step (2) into a Bi-ConvLSTM network based on residual connection for reconstruction and prediction, and inputting the crowd estimated density map sequence at the continuous time of { T-T., T-1, T } into the Bi-ConvLSTM network based on residual connection. The Bi-ConvLSTM network based on residual connection uses ConvLSTM as a basic structure, the ConvLSTM is replaced by a bidirectional ConvLSTM structure, an input crowd estimation density graph is calculated by forward and reverse superposition of two ConvLSTM units, and the output crowd estimation density graph comprises forward sequence information and reverse sequence information, is used for reconstructing a crowd estimation density graph sequence and predicting a future video frame sequence. For example, in data processing, 9 frames of pictures are input, the 5 th frame is predicted from 1 to 4 frames, the 5 th frame is predicted from 9 to 6 frames, and the results of the prediction are combined to obtain the result of the predicted final 5 th frame. The spatio-temporal sequence prediction problem is to predict the most likely K video frame sequences in the future from the previous J training video frames,
Figure BDA0002476139910000091
wherein the previous frame
Figure BDA0002476139910000092
Most likely future frame Fm={Xt+1,Xt+2...,Xt+KPredicting future frames
Figure BDA0002476139910000093
t representsAt the current moment, J denotes the number of previous frames, K denotes the number of predicted frames, and σ () is a softmax function.
The Bi-ConvLSTM network is composed of bidirectional ConvLSTM, BN, ReLU activation functions and residual connection structures. And finally, obtaining a crowd prediction density map (FP) at the T +1 moment through convolution and ReLU function activation, and carrying out integral statistics on the flow of people at the T +1 moment.
The specific content of the step 4 is as follows:
as shown in fig. 4, the training process: pre-training the multi-scale pyramid cavity convolution network Model (MPDC) in the step 2, storing model parameters, storing the obtained crowd estimation density map (DE) as the input of the step 3, and inputting the crowd estimation density map into a Bi-ConvLSTM network based on residual connection for training. The loss between the population predicted density map (FP) and the true density map (GT) is calculated using Euclidean distances, and the present invention employs a random gradient descent algorithm to optimize parameters until the loss value converges to the predicted value.
The euclidean distance is used to measure the difference between the predicted and true density maps of the population. When the distance between the crowd prediction density graph and the truth value density graph is generated by adopting Euclidean distance measurement, the loss function is defined as follows:
Figure BDA0002476139910000094
wherein N represents the number of pictures input into the multi-scale pyramid hole convolution network model, and Z (X)i(ii) a Theta) is a crowd density estimation graph corresponding to the ith input picture, XiWhich represents the i-th input picture,
Figure BDA0002476139910000101
a truth density plot for the ith input picture pair is shown. Θ represents the network parameters to be learned.
When the crowd predicted density map is evaluated, the commonly used Mean Square Error (MSE) and Mean Absolute Error (MAE) are adopted, the MSE is used for describing the accuracy of the crowd predicted density map, the accuracy is higher when the MSE is smaller, and the MAE can reflect the error condition of the crowd predicted density map.
Figure BDA0002476139910000102
Figure BDA0002476139910000103
N represents the number of pictures input into the multi-scale pyramid hole convolution network model, CiThe predicted number of people in the crowd predicted density graph corresponding to the ith input picture is shown,
Figure BDA0002476139910000104
and the real number of people corresponding to the ith input picture is represented.
The testing process comprises the following steps: and selecting a new continuous video frame data set, inputting the new continuous video frame data set into the trained model for testing, outputting a crowd prediction density graph, and counting results.

Claims (5)

1. A method for crowd counting and future people flow prediction based on video images is characterized by comprising the following steps:
step 1, selecting a video image data set with artificial annotation information, and performing Gaussian function processing according to annotation of head positions in the image to generate a true value density chart corresponding to an original image;
step 2, building a multi-scale pyramid hole convolution network model, inputting continuous video frames in a video image data set into the multi-scale pyramid hole convolution network model, fully extracting a characteristic diagram with multi-scale information, mapping the characteristic diagram into a crowd estimation density diagram, integrating the crowd estimation density diagram, and counting the number of people in the crowd estimation density diagram;
the multi-scale pyramid cavity convolution network model is divided into two parts: the structure of the first part is a VGG-Basic network; the second part is composed of four parallel cavity convolution layers with different cavity rates; channel splicing is carried out on the characteristic diagram output by each branch and the characteristic diagram output by the VGG-Basic network;
step 3, constructing a Bi-ConvLSTM network based on residual connection; inputting the obtained crowd estimated density map into a Bi-ConvLSTM network in a stacking manner for predicting a crowd predicted density map at the time of T +1 and estimating the number of pedestrians at the time of T + 1;
the bidirectional ConvLSTM module is an improvement on the traditional ConvLSTM, an input crowd estimation density graph is calculated by the forward and reverse superposition of two ConvLSTM units, and the output contains forward sequence information and reverse sequence information;
step 4, pre-training the multi-scale pyramid cavity convolution network model in the step 2, storing model parameters, storing the obtained crowd estimation density map, and inputting the crowd estimation density map into the Bi-ConvLSTM network based on residual connection in the step 3 for training; and (3) optimizing parameters in the multi-scale pyramid cavity convolution network model and the Bi-ConvLSTM network by adopting a random gradient descent algorithm, and calculating the loss between the crowd prediction density graph and the truth density graph by using Euclidean distance.
2. The method of claim 1, wherein the step 1 comprises the following steps:
the method comprises the steps that a two-dimensional Gaussian convolution kernel is utilized to convert head position labels in an input video image data set into a true value density graph, and the true value density graph is used as a training set with a true value for calculating loss difference;
in order to better correspond the truth density map to the dense crowd images at different viewing angles, the truth density map based on the geometric adaptive gaussian kernel is selected and represented by the following formula:
Figure FDA0002476139900000021
the truth value density chart is obtained by convolution of a delta pulse function and a Gaussian function, and the convolution is performed firstly and then the summation is performed; x is the number ofiRepresenting the position of the ith individual's head in the image, i.e. the pixel coordinates of the ith individual's head in the image, (x-x)i) Pulse function representing the position of the head in an image, N1The total number of the human heads in the image;
Figure FDA0002476139900000022
Figure FDA0002476139900000023
is a distance x from the head positioniThe average distance of the nearest m head positions proves that β is 0.3, and the effect is the best;
Figure FDA0002476139900000024
representing the position x of the human headiTo the head position xjThe distance of (c).
3. The method for crowd counting and future people flow prediction based on video images as claimed in claim 1 or 2, wherein the specific steps of the step 2 are:
generating an estimated crowd density map by continuous video images in the video image data set through a multi-scale pyramid cavity convolution network model;
the multi-scale pyramid cavity convolution network model is divided into two parts:
the first part is VGG-Basic, which takes a VGG-16 network as a Basic framework, only the first 10 convolutional layers and the first 3 maximum pooling layers are reserved, and the rest layers are completely removed;
the second part is composed of four parallel cavity convolution layers with different cavity rates, and four characteristic graphs of different receptive field multi-scale information are respectively generated; wherein the voidage is sequentially set to r-2lAnd l is 1,2,3 and 4, each hollow convolutional layer comprises 5 convolutional layers, the size of a convolutional kernel is set to be C3, then the feature maps output by the four branches and the feature map output by the VGG-Basic are spliced on a channel, 1 × 1 convolutional layer is adopted to carry out feature dimension reduction and is mapped into a crowd estimation density map (DE), and the integral statistics is carried out on the crowd estimation density map (DE).
4. The method of claim 3, wherein the step 3 comprises the following steps:
aiming at the density map prediction of continuous video images, a Bi-ConvLSTM network based on residual error connection is proposed; inputting the crowd estimated density map obtained in the step 2 into a Bi-ConvLSTM network based on residual connection for reconstruction and prediction, and inputting a crowd estimated density map sequence at { T-T., T-1, T } continuous time into the Bi-ConvLSTM network based on residual connection;
the Bi-ConvLSTM network based on residual connection uses ConvLSTM as a basic structure, the ConvLSTM is replaced by a bidirectional ConvLSTM structure, an input crowd estimation density graph is calculated by forward and reverse superposition of two ConvLSTM units, and an output characteristic graph comprises forward sequence information and reverse sequence information and is used for reconstructing a crowd estimation density graph sequence and predicting a future video frame sequence;
the spatio-temporal sequence prediction problem is to predict the most likely K video frame sequences in the future from the previous J training video frames,
Figure FDA0002476139900000031
wherein the previous frame
Figure FDA0002476139900000032
Most likely future frame Fm={Xt+1,Xt+2...,Xt+KPredicting future frames
Figure FDA0002476139900000033
t represents the current time, J represents the number of previous frames, K represents the number of predicted frames, σ (·) is the softmax function;
the Bi-ConvLSTM network consists of bidirectional ConvLSTM, BN, ReLU activation functions and a residual connection structure; and finally, obtaining a crowd prediction density map at the T +1 moment through convolution and ReLU function activation, and performing integral statistics on the pedestrian volume at the T +1 moment.
5. The method according to claim 3 or 4, wherein the step 4 comprises:
training process: pre-training the multi-scale pyramid cavity convolution network model in the step 2, storing model parameters, storing the obtained crowd estimation density map as input in the step 3, and inputting the crowd estimation density map into a Bi-ConvLSTM network based on residual connection for training;
calculating the loss between the crowd prediction density map and the truth value density map by using Euclidean distance, and optimizing parameters by adopting a random gradient descent algorithm until the loss value converges to the predicted value; when the distance between the crowd prediction density graph and the truth value density graph is generated by adopting Euclidean distance measurement, the loss function is defined as follows:
Figure FDA0002476139900000041
wherein N represents the number of pictures input into the multi-scale pyramid hole convolution network model, and Z (X)i(ii) a Theta) is a crowd density estimation graph corresponding to the ith input picture, XiWhich represents the i-th input picture,
Figure FDA0002476139900000042
a truth density chart representing the ith input picture pair; theta represents a network parameter to be learned;
when the crowd prediction density graph is evaluated, the commonly used mean square error MSE and mean absolute error MAE are adopted, and the specific steps are as follows:
Figure FDA0002476139900000043
Figure FDA0002476139900000044
n represents the number of pictures input into the multi-scale pyramid hole convolution network model, CiThe predicted number of people in the crowd predicted density graph corresponding to the ith input picture is shown,
Figure FDA0002476139900000045
representing the real number of people corresponding to the ith input picture;
the testing process comprises the following steps: and selecting a new continuous video frame data set, inputting the new continuous video frame data set into the trained model for testing, outputting a crowd prediction density graph, and counting results.
CN202010364590.1A 2020-04-30 2020-04-30 Method for crowd counting and future people flow prediction based on video image Active CN111611878B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010364590.1A CN111611878B (en) 2020-04-30 2020-04-30 Method for crowd counting and future people flow prediction based on video image

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010364590.1A CN111611878B (en) 2020-04-30 2020-04-30 Method for crowd counting and future people flow prediction based on video image

Publications (2)

Publication Number Publication Date
CN111611878A true CN111611878A (en) 2020-09-01
CN111611878B CN111611878B (en) 2022-07-22

Family

ID=72203064

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010364590.1A Active CN111611878B (en) 2020-04-30 2020-04-30 Method for crowd counting and future people flow prediction based on video image

Country Status (1)

Country Link
CN (1) CN111611878B (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112215129A (en) * 2020-10-10 2021-01-12 江南大学 Crowd counting method and system based on sequencing loss and double-branch network
CN112380960A (en) * 2020-11-11 2021-02-19 广东电力信息科技有限公司 Crowd counting method, device, equipment and storage medium
CN112418120A (en) * 2020-11-27 2021-02-26 湖南师范大学 Crowd detection method based on peak confidence map
CN112541891A (en) * 2020-12-08 2021-03-23 山东师范大学 Crowd counting method and system based on void convolution high-resolution network
CN112633106A (en) * 2020-12-16 2021-04-09 苏州玖合智能科技有限公司 Crowd characteristic recognition network construction and training method suitable for large depth of field
CN112767451A (en) * 2021-02-01 2021-05-07 福州大学 Crowd distribution prediction method and system based on double-current convolutional neural network
CN112861697A (en) * 2021-02-03 2021-05-28 同济大学 Crowd counting method and device based on picture self-symmetry crowd counting network
CN113191301A (en) * 2021-05-14 2021-07-30 上海交通大学 Video dense crowd counting method and system integrating time sequence and spatial information
CN113343790A (en) * 2021-05-21 2021-09-03 中车唐山机车车辆有限公司 Traffic hub passenger flow statistical method, device and storage medium
CN113920733A (en) * 2021-10-14 2022-01-11 齐鲁工业大学 Traffic volume estimation method and system based on deep network
CN114120233A (en) * 2021-11-29 2022-03-01 上海应用技术大学 Training method of lightweight pyramid hole convolution aggregation network for crowd counting
CN114154620A (en) * 2021-11-29 2022-03-08 上海应用技术大学 Training method of crowd counting network
US20220138475A1 (en) * 2020-11-04 2022-05-05 Tahmid Z CHOWDHURY Methods and systems for crowd motion summarization via tracklet based human localization
CN114499941A (en) * 2021-12-22 2022-05-13 天翼云科技有限公司 Training and detecting method of flow detection model and electronic equipment
FR3116361A1 (en) * 2020-11-18 2022-05-20 Thales Method for determining a density of elements in areas of an environment, associated computer program product
CN114543312A (en) * 2022-02-08 2022-05-27 珠海格力电器股份有限公司 Fresh air equipment control method and device, computer equipment and medium
CN117058627A (en) * 2023-10-13 2023-11-14 阳光学院 Public place crowd safety distance monitoring method, medium and system

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107862261A (en) * 2017-10-25 2018-03-30 天津大学 Image people counting method based on multiple dimensioned convolutional neural networks
CN108388852A (en) * 2018-02-09 2018-08-10 北京天元创新科技有限公司 A kind of region crowd density prediction technique and device based on deep learning
CN108615027A (en) * 2018-05-11 2018-10-02 常州大学 A method of video crowd is counted based on shot and long term memory-Weighted Neural Network
CN109101930A (en) * 2018-08-18 2018-12-28 华中科技大学 A kind of people counting method and system
CN109460855A (en) * 2018-09-29 2019-03-12 中山大学 A kind of throughput of crowded groups prediction model and method based on focus mechanism
CN109558862A (en) * 2018-06-15 2019-04-02 广州深域信息科技有限公司 The people counting method and system of attention refinement frame based on spatial perception
CN109815867A (en) * 2019-01-14 2019-05-28 东华大学 A kind of crowd density estimation and people flow rate statistical method
US20190347476A1 (en) * 2018-05-09 2019-11-14 Korea Advanced Institute Of Science And Technology Method for estimating human emotions using deep psychological affect network and system therefor
US20200118423A1 (en) * 2017-04-05 2020-04-16 Carnegie Mellon University Deep Learning Methods For Estimating Density and/or Flow of Objects, and Related Methods and Software

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200118423A1 (en) * 2017-04-05 2020-04-16 Carnegie Mellon University Deep Learning Methods For Estimating Density and/or Flow of Objects, and Related Methods and Software
CN107862261A (en) * 2017-10-25 2018-03-30 天津大学 Image people counting method based on multiple dimensioned convolutional neural networks
CN108388852A (en) * 2018-02-09 2018-08-10 北京天元创新科技有限公司 A kind of region crowd density prediction technique and device based on deep learning
US20190347476A1 (en) * 2018-05-09 2019-11-14 Korea Advanced Institute Of Science And Technology Method for estimating human emotions using deep psychological affect network and system therefor
CN108615027A (en) * 2018-05-11 2018-10-02 常州大学 A method of video crowd is counted based on shot and long term memory-Weighted Neural Network
CN109558862A (en) * 2018-06-15 2019-04-02 广州深域信息科技有限公司 The people counting method and system of attention refinement frame based on spatial perception
CN109101930A (en) * 2018-08-18 2018-12-28 华中科技大学 A kind of people counting method and system
CN109460855A (en) * 2018-09-29 2019-03-12 中山大学 A kind of throughput of crowded groups prediction model and method based on focus mechanism
CN109815867A (en) * 2019-01-14 2019-05-28 东华大学 A kind of crowd density estimation and people flow rate statistical method

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
FENG XIONG 等: ""Spatiotemporal Modeling for Crowd Counting in Videos"", 《ARXIV》 *
SHANGHANG ZHANG 等: ""FCN-rLSTM: Deep Spatio-Temporal Neural Networks for Vehicle Counting in City Cameras"", 《2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV)》 *
YANYAN FANG 等: ""LOCALITY-CONSTRAINED SPATIAL TRANSFORMER NETWORK FOR VIDEO CROWD COUNTING"", 《ARXIV》 *
刘旭: ""视频监控中的目标计数方法研究"", 《中国博士学位论文全文数据库 信息科技辑》 *

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112215129A (en) * 2020-10-10 2021-01-12 江南大学 Crowd counting method and system based on sequencing loss and double-branch network
US20220138475A1 (en) * 2020-11-04 2022-05-05 Tahmid Z CHOWDHURY Methods and systems for crowd motion summarization via tracklet based human localization
US11348338B2 (en) * 2020-11-04 2022-05-31 Huawei Technologies Co., Ltd. Methods and systems for crowd motion summarization via tracklet based human localization
CN112380960A (en) * 2020-11-11 2021-02-19 广东电力信息科技有限公司 Crowd counting method, device, equipment and storage medium
WO2022106556A1 (en) * 2020-11-18 2022-05-27 Thales Method for determining a density of elements in regions of an environment, and associated computer program product
FR3116361A1 (en) * 2020-11-18 2022-05-20 Thales Method for determining a density of elements in areas of an environment, associated computer program product
CN112418120A (en) * 2020-11-27 2021-02-26 湖南师范大学 Crowd detection method based on peak confidence map
CN112418120B (en) * 2020-11-27 2021-09-28 湖南师范大学 Crowd detection method based on peak confidence map
CN112541891A (en) * 2020-12-08 2021-03-23 山东师范大学 Crowd counting method and system based on void convolution high-resolution network
CN112633106A (en) * 2020-12-16 2021-04-09 苏州玖合智能科技有限公司 Crowd characteristic recognition network construction and training method suitable for large depth of field
CN112767451A (en) * 2021-02-01 2021-05-07 福州大学 Crowd distribution prediction method and system based on double-current convolutional neural network
CN112767451B (en) * 2021-02-01 2022-09-06 福州大学 Crowd distribution prediction method and system based on double-current convolutional neural network
CN112861697A (en) * 2021-02-03 2021-05-28 同济大学 Crowd counting method and device based on picture self-symmetry crowd counting network
CN112861697B (en) * 2021-02-03 2022-10-25 同济大学 Crowd counting method and device based on picture self-symmetry crowd counting network
CN113191301B (en) * 2021-05-14 2023-04-18 上海交通大学 Video dense crowd counting method and system integrating time sequence and spatial information
CN113191301A (en) * 2021-05-14 2021-07-30 上海交通大学 Video dense crowd counting method and system integrating time sequence and spatial information
CN113343790A (en) * 2021-05-21 2021-09-03 中车唐山机车车辆有限公司 Traffic hub passenger flow statistical method, device and storage medium
CN113920733A (en) * 2021-10-14 2022-01-11 齐鲁工业大学 Traffic volume estimation method and system based on deep network
CN114120233B (en) * 2021-11-29 2024-04-16 上海应用技术大学 Training method of lightweight pyramid cavity convolution aggregation network for crowd counting
CN114120233A (en) * 2021-11-29 2022-03-01 上海应用技术大学 Training method of lightweight pyramid hole convolution aggregation network for crowd counting
CN114154620A (en) * 2021-11-29 2022-03-08 上海应用技术大学 Training method of crowd counting network
CN114499941B (en) * 2021-12-22 2023-08-04 天翼云科技有限公司 Training and detecting method of flow detection model and electronic equipment
CN114499941A (en) * 2021-12-22 2022-05-13 天翼云科技有限公司 Training and detecting method of flow detection model and electronic equipment
CN114543312A (en) * 2022-02-08 2022-05-27 珠海格力电器股份有限公司 Fresh air equipment control method and device, computer equipment and medium
CN117058627A (en) * 2023-10-13 2023-11-14 阳光学院 Public place crowd safety distance monitoring method, medium and system
CN117058627B (en) * 2023-10-13 2023-12-26 阳光学院 Public place crowd safety distance monitoring method, medium and system

Also Published As

Publication number Publication date
CN111611878B (en) 2022-07-22

Similar Documents

Publication Publication Date Title
CN111611878B (en) Method for crowd counting and future people flow prediction based on video image
CN110781838B (en) Multi-mode track prediction method for pedestrians in complex scene
CN108805083A (en) The video behavior detection method of single phase
CN111476181B (en) Human skeleton action recognition method
CN111563447B (en) Crowd density analysis and detection positioning method based on density map
CN110852267B (en) Crowd density estimation method and device based on optical flow fusion type deep neural network
CN110147743A (en) Real-time online pedestrian analysis and number system and method under a kind of complex scene
CN111144329B (en) Multi-label-based lightweight rapid crowd counting method
CN109858424A (en) Crowd density statistical method, device, electronic equipment and storage medium
CN111191667B (en) Crowd counting method based on multiscale generation countermeasure network
CN111783589B (en) Complex scene crowd counting method based on scene classification and multi-scale feature fusion
CN106815563B (en) Human body apparent structure-based crowd quantity prediction method
CN110059616A (en) Pedestrian's weight identification model optimization method based on fusion loss function
CN112001278A (en) Crowd counting model based on structured knowledge distillation and method thereof
CN110991317A (en) Crowd counting method based on multi-scale perspective sensing type network
CN113139489A (en) Crowd counting method and system based on background extraction and multi-scale fusion network
Wang et al. Edge computing-enabled crowd density estimation based on lightweight convolutional neural network
CN113239904B (en) High-resolution dense target counting method based on convolutional neural network
CN114187506A (en) Remote sensing image scene classification method of viewpoint-aware dynamic routing capsule network
CN116170746B (en) Ultra-wideband indoor positioning method based on depth attention mechanism and geometric information
CN112115786A (en) Monocular vision odometer method based on attention U-net
CN113887536B (en) Multi-stage efficient crowd density estimation method based on high-level semantic guidance
CN115965905A (en) Crowd counting method and system based on multi-scale fusion convolutional network
CN115457464A (en) Crowd counting method based on transformer and CNN
CN114445765A (en) Crowd counting and density estimating method based on coding and decoding structure

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant