CN112633106A

CN112633106A - Crowd characteristic recognition network construction and training method suitable for large depth of field

Info

Publication number: CN112633106A
Application number: CN202011484694.2A
Authority: CN
Inventors: 田青; 唐绍鹏
Original assignee: Suzhou Jiuhe Intelligent Technology Co ltd
Current assignee: Suzhou Jiuhe Intelligent Technology Co ltd
Priority date: 2020-12-16
Filing date: 2020-12-16
Publication date: 2021-04-09

Abstract

A crowd characteristic identification network construction and training method suitable for large depth of field comprises the following steps: the front-end network adopts a VGG-16 network with a full connection layer removed, and adopts a convolution kernel of 3 multiplied by 3; in a VGG-16 network at the front end, three times of maximum pooling is carried out to reduce the resolution of a feature map; the back-end network is a three-layer branched network which respectively adopts the void convolution with the void ratio of 2 and 3. The network training step comprises: 1) generate density maps, 2) loss functions, 3) evaluation criteria. In the scheme, the loss of the prior information is added: during network training, the L2 distance is used as a loss function, the traditional L2 distance is used as the loss function, the problems of overestimating the crowd in the low density area and underestimating the crowd in the high density area exist, the method for calculating the loss function in a blocking mode is used, errors caused by the problems are greatly reduced, and the counting accuracy is effectively improved.

Description

Crowd characteristic recognition network construction and training method suitable for large depth of field

Technical Field

The invention relates to the field of population counting in computer vision, in particular to a model construction and training method for population characteristics with large depth of field based on a convolutional neural network.

Background

The main task of people counting is to identify people characteristics from the image and accurately calculate the number of people in the image. Early population counts were classified into detection-based and regression-based methods. In the detection-based approach, a sliding window detector is used to detect people in the scene and count the corresponding number of people. Detection-based methods are mainly divided into two broad categories, one is ensemble-based detection and the other is partial-body-based detection. The overall detection method, for example, the typical conventional method, mainly trains a classifier to detect pedestrians by using features such as wavelets, HOG, edges, etc. extracted from the whole body of the pedestrian. The learning algorithm mainly comprises methods such as SVM, boosting and random forest. The overall-based detection method is mainly applicable to sparse population counting, but with the increase of population density, the shielding between people becomes more and more serious. Therefore, a method based on partial body detection is used to deal with the people counting problem, which mainly counts the number of people by detecting partial structures of the body, such as the head, shoulders, etc. This method is slightly more efficient than the overall-based detection.

The regression-based method has the main idea that the mapping from the characteristics to the crowd number is learned, and the steps of the method are mainly divided into two steps, wherein the first step is used for extracting low-level characteristics such as foreground characteristics, edge characteristics, textures and gradient characteristics; the second step is to learn a regression model, such as linear regression, piecewise linear regression, ridge regression, and Gaussian process regression, to learn a mapping relationship of low-level features to population.

With the deep learning DL being widely used in various research fields (computer vision, natural language processing, etc.). DL is also used by researchers in the study of population counts by virtue of its excellent feature learning ability. The method comprises the steps of extracting crowd characteristics in an image for multiple times by designing a deep neural network, fusing characteristic graphs to generate a crowd density graph, and summing to obtain the number of people in the image, so that the purpose of counting the crowd is achieved.

Disclosure of Invention

The invention solves the technical problem that when people are counted in a closed space scene, the size of the people is changed too much due to overlarge depth of field, and a conventional network cannot adapt to and recognize the characteristics of people with various sizes, so that the counting accuracy is influenced.

Specifically, the method for constructing and training the crowd characteristic recognition network suitable for large depth of field comprises the following steps:

the network front end: the front-end network adopts a VGG-16 network with a full connection layer removed, and adopts a convolution kernel of 3 multiplied by 3;

an upper sampling layer: in a VGG-16 network at the front end, three times of maximum pooling is carried out to reduce the resolution of a feature map;

the network back end: the back-end network is a three-layer branched network which respectively adopts the hole convolution with the hole rate of 2 and 3, and the definition of the hole convolution is as follows:

wherein: x (m, n) is input image information with the length and the width of m and n respectively, and the output y (m, n) of the cavity convolution is obtained through a convolution kernel w (i, j); the parameter r represents the void rate; if r is 1, the hole convolution is a normal convolution;

the training step of the crowd characteristic network comprises the following steps:

1) generating a density map

The method for generating the density map adopts a mode of pulse function convolution Gaussian kernel to define the density map, and if the position of the labeled point is xi, the label with N heads is represented as:

here, it is convolved with the Gaussian function to become a continuous function;

using a density map of a geometrically adapted gaussian kernel, represented by:

for the position xi of the marking point of each human head, an average value di of k neighboring distances is given, so that the pixel related to xi corresponds to a region on the ground in the scene, and the radius of the region is proportional to di;

to estimate population density around pixel xi, H (x) is convolved with an adaptive Gaussian kernel whose variance σ i is variable and proportional to di;

2) loss function

During training, the learning rate of random gradient descent is fixed to 1e^-6(ii) a Measuring the distance between the generated density graph and the real value by adopting the Euclidean distance; the loss function is defined as follows:

n denotes the batch size, Xi denotes the picture, Z denotes the generated density map, Z denotes_i ^GTRepresenting a density graph ground route;

3) evaluation criteria

The mean square error MSE and mean absolute error MAE are adopted; MSE is used for describing the accuracy of the model, the accuracy is higher when the MSE is smaller, and the MAE can reflect the error condition of a predicted value;

n denotes the number of pictures in a test sequence, Ci denotes the number of predictors for picture Xi^GTRepresenting the actual number of people;

Z_l,wthe pixel value at (L, W) in the density map with length L and width W is shown.

The VGG-16 network at the front end of the network adopts the combination of 10 convolutional layers and 3 pooling layers.

In the scheme, the loss of the prior information is added: during network training, the L2 distance is used as a loss function, the traditional L2 distance is used as the loss function, the problems of overestimating the crowd in the low density area and underestimating the crowd in the high density area exist, the method for calculating the loss function in a blocking mode is used, errors caused by the problems are greatly reduced, and the counting accuracy is effectively improved.

Drawings

FIG. 1 is a schematic diagram of hole convolution;

fig. 2 is a schematic diagram of a network structure and a training process.

Detailed Description

The technical scheme is explained in the following with the accompanying drawings:

referring to fig. 2, a crowd counting model based on a multi-scale perception deep neural network includes:

1. the network front end:

a VGG-16 network with the full connectivity layer removed is employed and a convolution kernel of 3 x 3 is employed. Studies have shown that for the same size perceptual domain, the smaller the convolution kernel, the more convolutional layers of models are better than those with larger convolution kernels and fewer convolutional layers. To balance accuracy and resource overhead, the VGG-16 network herein employs a combination of 10 convolutional layers and 3 pooling layers.

2. Upper sampling layer

The front end adopts a VGG-16 network, three times of maximum pooling is carried out, so that the resolution of the obtained feature map is reduced, and the resolution of the feature map is restored by adopting an up-sampling method.

3. Network backend

The back-end network is a three-layer branch network, and the void convolutions with the void rates of 2 and 3 are respectively adopted, and the definition of the void convolution is as follows:

and x (M, N) is input image information with the length and the width being M and N respectively, and the output y (M, N) of the cavity convolution is obtained through a convolution kernel w (i, j), wherein the parameter r represents the cavity rate. If r is 1, the hole convolution is a normal convolution. Experiments prove that the hole convolution utilizes sparse convolution kernels to realize alternate convolution and pooling operation, the sensing domain is enlarged on the premise of not increasing network parameters and calculation scale, and the method is more suitable for the crowd density estimation task. While the conventional convolution operation needs to increase the number of convolution layers to obtain a larger sensing domain, and also increases more data operations. The convolution kernel of K x K is expanded to K + (K-1) (r-1) by the hole convolution operation with a hole rate of r. The perceptual domain with a convolution kernel size of 3 × 3 in fig. 1 is enlarged to 5 × 5 and 7 × 7, respectively.

4. Training method

4.1) generating a Density map

Method for generating density map reference is made to the method in MCNN (CVPR 2016). The density map is defined by convolving a gaussian kernel with a pulse function. Assuming that the location of the annotation point is xi, then a label with N heads can be represented as

Where it is convolved with a gaussian function to become a continuous function. But such a density function assumes that each xi is independent in image space. In fact, however, each xi is a sample of the population density in the 3D scene, and due to perspective distortion, the pixels associated with different samples xi coincide with different dimensions of the regions in the scene. Thus, to accurately estimate population density, the perspective transformation needs to be considered. If it is assumed that the population density is uniform around a head region, its nearest neighbors give a reasonable estimate of the geometric deformation. In order to make the density map better correspond to images with different viewing angles (different head sizes) and dense population, a density map with a geometrically adapted gaussian kernel is used, which is represented by the following formula:

for each person's head xi point, the average di of the k neighbor distances is given, so the pixel associated with xi corresponds to a region on the ground in the scene whose radius is proportional to di. Thus, to estimate population density around pixel xi, we need to convolve H (x) with an adaptive Gaussian kernel whose variance σ i is variable and proportional to di.

4.2) loss function

During training, the learning rate of the random gradient descent is fixed at 1 e-6. The euclidean distance is used to measure the distance between the density map we generated and the true value. The loss function is defined as follows:

n denotes a batch size, Z denotes a generated density map, and ZGT denotes a density map ground route

The L2 distance is used as a loss function to solve the problems of overestimating the crowd in the low-density area and underestimating the crowd in the high-density area, and the loss function is calculated in a blocking mode, so that the loss is calculated for the high-density area (the crowd-dense area obtained by data comparison) and the low-density area respectively to reduce the error.

4.3) evaluation criteria

When the crowd density estimation model is evaluated, in order to compare with the latest research, Mean Square Error (MSE) and Mean Absolute Error (MAE) which are commonly adopted by researchers are adopted, the MSE is used for describing the accuracy of the model, the smaller the MSE is, the higher the accuracy is, and the MAE can reflect the error condition of a predicted value.

N represents the number of pictures in a test sequence, Ci represents the predicted number of people for picture Xi, CiGT represents the actual number of people

Zl, W represents a pixel value at (L, W) in the density map having a length L and a width W.

In some scenarios, the crowd density varies from time to time. The scheme uses a method of calculating the loss function in a blocking mode, and the loss is calculated respectively for a high-density area (a crowd-concentrated area obtained by data comparison) and a low-density area so as to reduce the error.

Taking the statistical detection of the number of people applied to the closed space as an example, because the image collected by the camera has larger depth of field in the closed space, the size change of the head is large in the same collected image. It is more appropriate to use a multi-scale neural network for identification.

Claims

1. A crowd characteristic identification network construction and training method suitable for large depth of field is characterized in that

The crowd characteristic identification network comprises:

wherein: x (M, N) is input image information with the length and the width being M and N respectively, and the output y (M, N) of the cavity convolution is obtained through a convolution kernel w (i, j); the parameter r represents the void rate; if r is 1, the hole convolution is a normal convolution;

1) generating a density map:

defining a density map by adopting a mode of convolution of a Gaussian kernel by a pulse function;

assuming that the position of the marking point is xi, the label with N heads is set as H (x); if it is assumed that the population density is uniform around a head region, its nearest neighbors give a reasonable estimate of the geometric deformation;

in order to enable the density map to better correspond to images with different visual angles and dense crowds, a density map of a geometric adaptive Gaussian kernel is used; for each head location point xi, an average di of a plurality of neighboring distances is given, the pixel associated with xi corresponds to a region on the ground in the scene, the radius of which is proportional to di; to estimate population density around xi, H (x) is convolved with an adaptive Gaussian kernel whose variance σ i is variable and proportional to di;

2) loss function

During training, the learning rate of random gradient descent is fixed to 1 e-6;

measuring the distance between the generated density graph and the real value by adopting the Euclidean distance; respectively calculating losses of the high-density area and the low-density area by using a method of calculating a loss function in blocks to reduce errors;

3) evaluation criteria

When the crowd density estimation model is evaluated, mean square error MSE and mean absolute error MAE are adopted, the MSE is used for describing the accuracy of the model, the accuracy is higher when the MSE is smaller, and the MAE can reflect the error condition of a predicted value.

2. The method for constructing and training the crowd-sourcing feature recognition network of claim 1, wherein the VGG-16 network in the front end of the network is a combination of 10 convolutional layers and 3 pooling layers.