CN111191667A

CN111191667A - Crowd counting method for generating confrontation network based on multiple scales

Info

Publication number: CN111191667A
Application number: CN201811356818.1A
Authority: CN
Inventors: 咸良; 杨建兴; 周圆
Original assignee: Tianjin University Marine Technology Research Institute
Current assignee: Tianjin University Marine Technology Research Institute
Priority date: 2018-11-15
Filing date: 2018-11-15
Publication date: 2020-05-22
Anticipated expiration: 2038-11-15
Also published as: CN111191667B

Abstract

A crowd counting method based on multi-scale generation of an confrontation network adopts a confrontation training mode to predict crowd density. And the maximum and minimum problems of the generated model and the discriminant model are optimized by adopting a training mode of joint alternating iteration. Wherein the training generation network is used to generate an accurate population density map to fool the discriminators, and conversely, the discriminators are trained to discriminate between the generated density map and the true density map labels. At the same time, the output of the discriminator will provide feedback to the generator of the density map location and prediction accuracy. The two networks compete for training at the same time so as to improve the generated effect until the sample generated by the generator cannot be correctly judged by the discriminator. After the confrontation loss is introduced, the crowd density detection algorithm provided by the patent adopts the confrontation training mode to enable the convolutional neural network to generate a density map with higher quality, so that the accuracy of network crowd counting is improved.

Description

Crowd counting method for generating confrontation network based on multiple scales

Technical Field

The invention relates to the field of image processing and computer vision, in particular to a crowd counting algorithm for generating an confrontation network based on multiple scales.

Background

With the increasing population of China, large-scale crowds gather more and more. In order to effectively control the number of people in public places and prevent accidents caused by crowd density overload, video monitoring is the current main means. In the field of video monitoring and security protection, people analysis attracts more and more researchers' attention, and becomes a research subject of intense fire in the field of computer vision at present. The crowd counting task is to accurately estimate the total number of people in the picture and simultaneously give the distribution condition of crowd density. The picture crowd counting can be used in many fields, such as accident prevention, space planning, consumption habit analysis, traffic scheduling, and the like.

At present, the mainstream population counting algorithms applied to intelligent monitoring are mainly divided into two types: a population counting algorithm based on detection and a population counting algorithm based on regression. The people counting algorithm based on detection is that in a monitoring video, all pedestrians in each frame of image are supposed to be accurately detected and positioned through a manually designed visual target detector, and the estimated number of people is obtained by accumulating all detected targets. Papageorgiou et al earlier in 1998 proposed training SVM classifiers for use in pedestrian detection tasks by extracting wavelet features at different scales in the image. Lin et al, in 2001, proposed an improved method, namely, by performing histogram equalization and Haar wavelet transform on an image in advance, then extracting multi-scale human head profile statistical features, and finally using an SVM for detector training, the algorithm can obtain a more accurate crowd detection counting result when the video definition is high, but the algorithm is greatly influenced by environmental changes and the view angle of a monitoring lens. Dalal et al put forward a pedestrian detection algorithm based on Histogram of Oriented Gradient (HOG) features in 2005, combined with a linear SVM to classify and detect the crowd in the image and count to obtain the crowd number, further improving the accuracy of pedestrian detection.

However, when the crowd density in the monitored scene is high, the crowd occlusion problem always causes that the crowd counting algorithm based on the detector cannot accurately detect and track most pedestrians.

Disclosure of Invention

In order to solve the problems in the prior art, the crowd counting method for generating the confrontation network based on multiple scales is characterized in that the characteristics of a single-row convolutional neural network at different depths are fused, the problems of scale change, shielding and the like in crowd images are solved, meanwhile, the confrontation loss of a discriminator is added into a network model, the crowd density is predicted by adopting a confrontation training mode, and a density map with higher quality is generated.

The crowd counting method for generating the confrontation network based on the multi-scale comprises the following specific steps:

1. population scene Gaussian kernel density map

The present invention converts the given head coordinate data into the form of a crowd density distribution map, and for the given head coordinate data in the crowd image of the data set, the given head coordinate data is provided with the crowd image and the correspondingly marked head coordinate

Can be made of discrete

The coordinate positions of the corresponding heads are represented, so the positions of the N heads in each image can be labeled:

to convert the head position coordinate function into a continuous density function, a Gaussian filter is used

And (3) convolving with the head position function to obtain a density equation, wherein the specific equation is as follows:

2. building a multiscale generative confrontation network

The crowd counting method based on the deep convolutional neural network has the advantages that the quality of a predicted crowd density map is not ideal enough in a complex kernel high-density crowd scene, and the reason is mainly that in the complex crowd scene, pedestrians and backgrounds have high similarity, and the convolutional neural network method has the phenomena of error detection and classification. Meanwhile, the quality of the predicted density map greatly affects the accuracy of population counting. The invention provides a crowd counting method for generating a confrontation network (MS-GAN) based on Multi-Scale convolution, and a Multi-Scale generation adaptive network (MS-GAN) function is introduced to improve the accuracy of prediction.

The structure diagram of the multi-scale generation confrontation network model is shown in a figure I, and mainly comprises two parts: a generator and a discriminator. The generator is a multi-scale convolution neural network, the generator takes a crowd image as input and outputs the image which is already introduced in the forecast of the predicted crowd density, then the obtained density map and the crowd image are overlapped and input into a discriminator, and the discriminator is trained to discriminate whether the generated density map or the real density map is input. Meanwhile, due to the overlapping input of the crowd images, the discriminator needs to distinguish whether the generated density map is matched with the crowd images.

3. Content loss function based design

In the network model proposed herein, the generator is configured to learn a mapping from a crowd image to a corresponding crowd density map, an output of the network model is a predicted crowd density map, a loss function based on a pixel level is adopted herein, a euclidean distance between the predicted density map and a true density map is calculated as a loss function of the network, and the loss function is a Mean Square Error (MSE) at the pixel level:

wherein

Representing density maps, parameters, generated by the generator

As parameters of the generator network model. In addition to this, the present invention is,

represents the first

An image of a group of people is displayed,

representing images of a group of people

True label density map of (1).

Expressed as the number of all training images.

4. Design of penalty function

The purpose of the discriminator is to distinguish the difference between the generated density map and the true label density map. Therefore, the density map label herein that marks the generated density map as 0 true is marked as 1. The output of the discriminator represents the probability that the generated density map is a true density map. An additional penalty function is used in the method to improve the quality of the generated density map. The penalty on confrontation (adaptive Loss) function is expressed as follows:

wherein ,

density map representing predictions

Degree of matching with the corresponding crowd image. The output of the discriminator is in the form of tensor, and the crowd image is obtained

And the generated density map

Or true density map labels

The superposition is performed in a third dimension. Finally, the loss function for the generator is a weighted sum of the mean squared error and the penalty, as follows:

the effect of the text through a large number of experiments, setting

To balance the specific gravity of the two loss values. The model proves that the combination of two loss functions enables the training of the network to be more stable and the prediction of the density map to be more accurate in the actual training process.

5. Antagonistic network joint training

The crowd density map prediction model based on the confrontation network is different from the original aim of generating the confrontation network, and the aim of crowd density estimation is to generate an accurate density map instead of generating a real natural image. Thus, the input in the population density estimation model proposed herein is no longer subject to random noise that is too distributed but rather to a population image. Secondly, because the crowd image contains the distribution information of the crowd scene, the crowd image is taken as the condition information of the crowd density map in the crowd density prediction model provided by the text, and the crowd density map and the crowd image are simultaneously input into the discriminator. In the actual training process, a conditional countermeasure network model is adopted, and the purpose of joint training is to estimate a high-quality crowd density map. The joint training formula of the generator and the arbiter is as follows:

wherein ,

represented as a generator, a network of generators, and a crowd image

The output of the network is a predicted population density map as input

，

Denoted as the discriminator, the output of the discriminator is the probability that the crowd density image is a true density map. The purpose of the discriminator is to discriminate the density map generated by the generator

And true label density map

. While the training generator produces a high quality density map that cannot be discerned by the discriminator.

And carrying out crowd density prediction by adopting a confrontation training mode based on the crowd counting model for generating the confrontation network. And the maximum and minimum problems of the generated model and the discriminant model are optimized by adopting a training mode of joint alternating iteration. Wherein the training generation network is used to generate an accurate population density map to fool the discriminators, and conversely, the discriminators are trained to discriminate between the generated density map and the true density map labels. At the same time, the output of the discriminator will provide feedback to the generator of the density map location and prediction accuracy. The two networks compete for training at the same time so as to improve the generated effect until the sample generated by the generator cannot be correctly judged by the discriminator. After the confrontation loss is introduced, the crowd density detection algorithm provided by the patent adopts the confrontation training mode to enable the convolutional neural network to generate a density map with higher quality, so that the accuracy of network crowd counting is improved.

Drawings

Fig. 1 is a diagram of a multi-scale generative confrontation network architecture.

Detailed Description

The present invention needs to solve the problem of giving a crowd image or a frame in a video and then estimating the crowd density and the crowd total in each area of the image.

The structure of the multi-level convolutional neural network is shown as a generator part in fig. 1, in the first three volume blocks of the network, a multi-scale convolution module (initiation) is adopted to respectively extract multi-scale features on three different volume blocks of Conv-1, Conv-2 and Conv-3, and the multi-scale convolution module adopts three different scales and sizes

The convolution kernel obtains the features of different scales, and each multi-scale convolution module performs multi-scale feature expression on the depth features. In order to make the sizes of the different-sized characteristic diagrams consistent, the characteristic diagrams with different sizes are pooled to be uniform by adopting a pooling method. Wherein, conv-1 adopts two-layer pooling, and conv-2 adopts one-layer pooling, and the sizes of the two layers are consistent with those of conv-3. And finally, inputting the features of different levels and scales into a conv-4 convolution layer, and performing feature fusion by adopting a 1 x 1 convolution kernel. And finally fusing three features with different scales in the network, and performing density map regression by using the fused feature map. The network can greatly improve the detection of small-scale pedestrians in a high-density crowd scene, and finally improves the prediction effect of the crowd density map.

The multi-scale generation confrontation network model of the fusion discriminator is shown in fig. 1, and mainly comprises two parts, namely a generator and a discriminator, wherein the generator is a multi-scale convolution neural network already introduced in the foregoing, the generator takes a crowd image as an input and outputs the crowd image as a predicted crowd density image, then the obtained density image and the crowd image are overlapped and simultaneously input into the discriminator, and the discriminator is trained to discriminate whether the input is the generated density image or a real density image. Meanwhile, due to the overlapping input of the crowd images, the discriminator needs to distinguish whether the generated density map is matched with the crowd images.

In the experiment, training is carried out on an NVIDIA GeForce GTX TITAN X graphic display card based on a Tensorflow deep learning framework, a random gradient descent (SGD) method is adopted in the whole network, parameter optimization is carried out on the network by using an Adam algorithm, and the learning rate is set to be

The momentum is set to 0.9. The parameters of the generator and the discriminator are initialized using normal distribution functions. Because the training data set adopted in the method is small in data volume, the batch size (batch size) is set to be 1, and in the training process, the generator and the arbiter alternately and iteratively optimize, firstly, the generator is trained on 20 epochs based on the average square loss, on the basis, the arbiter is added, and the two networks are alternately optimized to perform 100 epochs of training. The input of the discriminator is input in the form of tensor, the structure of the tensor mainly comprises an original image of three channels of RGB and a density image of a single channel, and the final dimension of the structure is

The tensor of (a).

The present invention compares to other methods on the UCF _ CC _50 dataset. The evaluation criteria for experimental results used Mean Absolute Error (MAE):

and Mean Square Error (MSE):

(N is the number of pictures,

the actual number of people in the ith image,

the number of people who output the ith image through the network provided by the invention) to measure the accuracy of the algorithm. On the UCF _ CC _50 mall data set, the present invention compares to the prior art algorithm, as followsAs shown in the table (MS-GAN is the algorithm of the present invention):

experimental performance comparison in the table shows that the method is better than MCNN and CrowdNet in accuracy and stability, and MS-GAN performs best in various crowd counting algorithms based on the convolutional neural network on two indexes of MSE and MAE, so that the estimation error of the number of people in various scenes is relatively average, and the method has relatively higher stability. The quality of the predicted crowd density map is obviously superior to other crowd counting methods based on the convolutional neural network.

Claims

1. The crowd counting method for generating the confrontation network based on multiple scales is characterized by comprising the following steps: the method comprises the following specific steps:

1 population scene gaussian kernel density map:

converting given head coordinate data into a form of a population density profile for a given head coordinate in a population image of the data set

Can be made of discrete

2, constructing a multi-scale generation countermeasure network:

the multi-scale generation countermeasure network model structure mainly comprises two parts: the device comprises a generator and a discriminator, wherein the generator is a multi-scale convolutional neural network, the generator takes a crowd image as input and outputs the crowd image as a predicted crowd density map, then the obtained density map and the crowd image are overlapped and simultaneously input into the discriminator, the discriminator is trained to discriminate whether the input is the generated density map or a real density map, and meanwhile, due to the overlapping input of the crowd image, the discriminator needs to discriminate whether the generated density map is matched with the crowd image;

3, designing based on a content loss function:

calculating Euclidean distance between the predicted density map and the real density map as a loss function of the network by adopting a loss function based on a pixel level, wherein the loss function adopts Mean Square Error (MSE) of the pixel level:

wherein

Representing density maps, parameters, generated by the generator

As parameters of the generator network model, in addition to this,

represents the first

An image of a group of people is displayed,

representing images of a group of people

A map of the true label density of (c),

expressed as the number of all training images;

4, designing a resistance loss function:

an additional penalty function is used to improve the quality of the generated density map, the penalty (additive loss) function is expressed as:

wherein ,

density map representing predictionsThe output of the discriminator is in the form of tensor according to the matching degree of the corresponding crowd image

And the generated density map

Or true density map labels

The superposition is performed in the third dimension and finally the loss function for the generator is a weighted sum of the mean squared error and the penalty, which is shown below:

through the effects of a large number of experiments, set up

The proportion of two loss values is weighed, and the combination of two loss functions of the model enables the training of the network to be more stable and the prediction of the density map to be more accurate;

5, antagonistic network joint training:

adopting a conditional confrontation network model, aiming at estimating a high-quality crowd density graph, a joint training formula of a generator and a discriminator is as follows:

wherein ,

represented as a generator, a network of generators, and a crowd image

The output of the network is a predicted population density map as input

，

Represented as a discriminator whose output is the probability that the crowd density image is a true density map, the purpose of the discriminator being to discriminate the density map generated by the generator

And true label density map

Simultaneous training generatorA high quality density map is produced that cannot be discerned by the discriminator.