CN113283356A

CN113283356A - Multi-level attention scale perception crowd counting method

Info

Publication number: CN113283356A
Application number: CN202110605990.1A
Authority: CN
Inventors: 祝鲁宁; 黄良军; 沈世晖; 张亚妮
Original assignee: Shanghai Institute of Technology
Current assignee: Shanghai Institute of Technology
Priority date: 2021-05-31
Filing date: 2021-05-31
Publication date: 2021-08-20
Anticipated expiration: 2041-05-31
Also published as: CN113283356B

Abstract

The invention provides a multi-level attention scale perception crowd counting method, and belongs to application of deep learning in computer vision. The method comprises the following specific steps: s1: acquiring a data set; s2: constructing a multi-level attention scale perception neural network; s3: debugging and training a multi-level attention scale perception neural network and testing; s4: and acquiring a camera image, inputting the trained neural network for testing, and acquiring a predicted density map and the predicted number of people of the image. By the mode, the method can be suitable for detecting the number of people in a large-scale scene, and the accuracy of the detection result is effectively improved.

Description

Multi-level attention scale perception crowd counting method

Technical Field

The invention relates to a multi-level attention scale perception crowd counting method.

Background

With the acceleration of the national urbanization pace and the rapid development of urban economy, the crowd meeting scenes are increased, the number of tourists is increased, and meanwhile, the potential safety hazard is also accompanied. Therefore, by designing a crowd counting method, the number of crowds is predicted, the early warning is carried out on a highly crowded scene, related personnel can be helped to carry out early warning and decision-making before and after an emergency, the life and property safety of people can be guaranteed, and dangerous events are avoided.

The existing population counts are mainly divided into two types: 1) methods based on traditional methods, such as support vector machines, decision trees, etc.; 2) deep learning-based methods, such as the channel and collateral methods of Networks such as MCNN and CSRNet. The population counting method based on deep learning has certain limitation. The method 1) adopts the traditional method, and has high complexity and poor precision; the method 2) uses the existing neural network, and has the problems of low precision and the like.

Disclosure of Invention

The invention aims to provide a multi-level attention scale perception crowd counting method.

In order to solve the above problems, the present invention provides a multi-level attention scale perception crowd counting method, comprising:

s1: acquiring a data set and preprocessing the data set to obtain a density map of a training set and a density map of a testing set;

s2: constructing a backbone of a multi-level attention scale perception neural network;

s3: debugging and training the multi-level attention scale perception neural network and testing the effectiveness of the network based on the training set, the testing set and the backbone of the multi-level attention scale perception neural network to obtain the trained neural network;

s4: and acquiring a camera image, inputting the trained neural network for testing, and acquiring a predicted density map and the predicted number of people of the image.

Further, the step S1 includes:

s11: downloading a public data set, and dividing the public data set into a training set and a testing set;

s12: carrying out data enhancement on the training set and the test set, turning the image left and right, and increasing the data volume by one time to respectively obtain image data of the training set and the test set of the test set;

s13: supplementing the image data of the training set and the width and height pixels of the test set into multiples of 16, and proportionally adjusting the position of a positioning diagram to obtain a positioning diagram of a label of the training set and a positioning diagram of a label of the test set;

s14: and processing the positioning graph of the labels of the training set of the labels into a density graph of the training set by using a Gaussian kernel function with the Gaussian kernel size of 15, and processing the positioning graph of the labels of the test set of the labels into a density graph of the test set.

Further, the step S2 includes:

s21: designing the structure of the encoder for extracting features: taking the first ten layers of the VGG16 as feature extraction layers, kernel ═ 3, performing Conv2d convolution, adding a Relu activation function after each convolution layer, wherein the number of the layers is 64, 64, 128, 128, maxporoling (kernel ═ 2), 256, 256, maxporoling (kernel ═ 2), 512, 512, 512, extracting a depth feature by using the structure of the encoder, and loading VGG16 pre-training parameters;

s22: and designing a regression population density graph and the number of people of the decoder.

Further, the step S22 includes:

s221: the back-end backbone network is SA-UGA-SA-UPA-SA-USA, the scale perception module SA and kernel are 3, the number of input network layers is 512, the number of output network layers is 128, and then a Relu activation function is connected; a custom channel attention upsampling module UGA; two-dimensional convolution, wherein kernel is 3, the number of input network layers is 128, the number of output network layers is 64, and then a Relu activation function is connected; a custom spatial attention upsampling module USA; two-dimensional convolution, wherein kemel is 3, the number of input network layers is 64, the number of output network layers is 16, and then a Relu activation function is connected; a custom pixel attention upsampling module USA; the last layer is a full convolution network with 16 input network layers, 1 output network layer and 1 kernel, and is connected with a Relu activation function, and a residual error structure is added between SA and SA to finally output a prediction density graph;

s222: constructing a pixel attention module: performing two-dimensional convolution on an input image in, wherein an input channel is equal to an output channel, kernel is 1, then performing sigmoid function processing to obtain out, and finally outputting a point product plus in of in and out;

s223: constructing a custom multi-level attention module: the channel attention upsampling module (UGA) flow is twice upsampling, the sum of self-adaptive average pooling and self-adaptive maximum pooling is taken, and the feature graph obtained by twice upsampling is added to the sum of the self-adaptive average pooling and the self-adaptive maximum pooling through a Sigmoid function, so that the feature graph is the output feature graph of the module; the spatial attention upsampling module (USA) has twice upsampling flow, and is connected with the average value and the maximum value of the channel layer surface characteristic graph according to the channel, and the characteristic graph after twice upsampling is the output characteristic graph of the module after the convolution with the convolution kernel size of 7 and the expansion rate of 3 and the Sigmoid function are added; the pixel attention upsampling module (UPA) flow is two-dimensional convolution with a convolution kernel of 3 through double upsampling, two-dimensional convolution with a convolution kernel of one, output of the two-dimensional convolution with a convolution kernel of one and weight after Sigmoid function output, and convolution with a convolution kernel of one and a convolution kernel of 3, namely the output characteristic diagram of the module.

S224: and constructing a custom dimension perception module SA. The input is x, and the number c of x channels is copied; inputting four parallel different modes to extract features, wherein the size of a first layer convolution kernel is 3, and the cavity convolution variance is 1, so that f1 is obtained; setting different hole convolutions in the second layer to obtain f 2; setting different hole convolutions on the third layer to obtain f 3; the fourth layer sets up a different hole convolution to yield f 4. Adding f1 with the f2 of the next layer through a pixel attention module, wherein the number of channels after two-dimensional convolution is c, obtaining y1 through the pixel attention module, connecting with the f3 of the next layer, the number of channels after two-dimensional convolution is c, obtaining y2 through the pixel attention module, connecting with the f4 of the next layer, the number of channels after two-dimensional convolution is c, obtaining y3, y1, y2 and y3 through the pixel attention module, connecting according to the channels, and then performing two-dimensional convolution to change the number of channels from 3c to c, and obtaining output y.

Further, the step S3 includes:

s31: setting a loss function and parameters: the loss function uses mse mean square error, uses Adam optimizer, the blocksize is set to 1, the learning rate is 0.00001, the epoch is set to 1000;

s32: inputting the processed Gaussian image into a neural network for training;

s33: and loading the trained network parameters, testing the evaluation functions mae and mse by using the test set, and estimating the network performance.

Further, the step S4 includes:

s41: processing the picture acquired by the camera to be smaller than 768 times 1024 pixels to obtain a processed picture;

s42: and inputting the processed pictures into the trained neural network to obtain a corresponding predicted density map and the number y of the predicted persons.

Compared with the prior art, the method has the advantages that through S1: acquiring a data set; s2: constructing a multi-level attention scale perception neural network; s3: debugging and training a multi-level attention scale perception neural network and testing; s4: and acquiring a camera image, inputting the trained neural network for testing, and acquiring a predicted density map and the predicted number of people of the image. By the mode, the method can be suitable for detecting the number of people in a large-scale scene, and the accuracy of the detection result is effectively improved.

Compared with the prior art, the invention has the beneficial effects that:

1: the invention can carry out more accurate crowd density and quantity estimation on large-scale crowd;

2: the structure of a classical convolutional neural network is improved, a simple convolutional network layer is replaced by a multi-level attention module and a custom scale perception module, an Adam optimizer is used for optimizing an initial weight threshold of the neural network, the convergence speed of the network is accelerated, the network is close to the optimal parameters of the network, and the extraction of different characteristics by the network is enhanced;

3: according to the invention, on the basis of extracting the characteristics of the front ten layers of VGG16, the characteristic information of different spaces is further extracted through a user-defined scale sensing module, the attention of a network to dense people is improved, and the problem that the extraction of the characteristics of a single scale is not comprehensive enough is solved. The weights of the effective features under different scales are increased through multi-level attention, the background weight is weakened, and the regression performance is improved.

Drawings

FIG. 1 is a schematic flow chart structure diagram of a crowd counting method based on multi-level attention scale perception according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a multi-level attention-scale aware neural network architecture according to one embodiment of the present invention;

FIG. 3 is a schematic diagram of a channel attention upsampling module (UGA) of one embodiment of the present invention;

FIG. 4 is a schematic diagram of a spatial attention upsampling module (USA) of one embodiment of the present invention;

FIG. 5 is a schematic diagram of a pixel attention upsampling module (UPA) according to one embodiment of the present invention;

FIG. 6 is a schematic diagram of a scale-aware module architecture according to one embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

As shown in fig. 1, the present invention provides a multi-level attention scale perception population counting method, comprising:

a crowd counting method based on multi-level attention scale perception is characterized by comprising the following steps:

The invention provides a method for perceiving a neural network by adopting a multi-scale, which can effectively extract the characteristics of crowds with different densities, and meanwhile, the attention of the network is concentrated to the region with dense crowds in a single picture by utilizing the attention of different scales, so that the problem that the characteristics of the characteristics extracted by a single scale are not rich is solved, and the practical significance of the characteristic maps of multiple levels on learning proper characteristic expression is enhanced by the attention of different scales.

Further, the step S1 includes:

Further, the step S2 includes:

Further, the step S22 includes:

s221: the back-end backbone network is SA-UGA-SA-UPA-SA-USA, the scale perception module SA and kernel are 3, the number of input network layers is 512, the number of output network layers is 128, and then a Relu activation function is connected; a custom channel attention upsampling module UGA; two-dimensional convolution, wherein kernel is 3, the number of input network layers is 128, the number of output network layers is 64, and then a Relu activation function is connected; a custom spatial attention upsampling module USA; two-dimensional convolution, wherein kernel is 3, the number of input network layers is 64, the number of output network layers is 16, and then a Relu activation function is connected; a custom pixel attention upsampling module USA; the last layer is a full convolution network with 16 input network layers, 1 output network layer and 1 kernel, and is connected with a Relu activation function, and a residual error structure is added between SA and SA to finally output a prediction density graph;

S224: and constructing a custom dimension perception module SA. The input is x, and the number c of x channels is copied; inputting four parallel different modes to extract features, wherein the size of a first layer convolution kernel is 3, and the cavity convolution variance is l, so that f1 is obtained; setting different hole convolutions in the second layer to obtain f 2; setting different hole convolutions on the third layer to obtain f 3; the fourth layer sets up a different hole convolution to yield f 4. Adding f1 with the f2 of the next layer through a pixel attention module, wherein the number of channels after two-dimensional convolution is c, obtaining y1 through the pixel attention module, connecting with the f3 of the next layer, the number of channels after two-dimensional convolution is c, obtaining y2 through the pixel attention module, connecting with the f4 of the next layer, the number of channels after two-dimensional convolution is c, obtaining y3, y1, y2 and y3 through the pixel attention module, connecting according to the channels, and then performing two-dimensional convolution to change the number of channels from 3c to c, and obtaining output y.

Further, the step S3 includes:

s32: inputting the processed Gaussian image into a neural network for training;

Further, the step S4 includes:

Specifically, as shown in fig. 1, the present invention provides a crowd counting method based on multi-level attention scale perception, including:

s1: acquiring a data set and preprocessing the data set;

s2: constructing a multi-level attention scale perception neural network backbone;

s3: debugging and training a multi-level attention scale perception neural network and testing the effectiveness of the network;

As shown in fig. 2, the present invention provides a crowd counting method based on multi-level attention scale perception, further elaborating the details of the multi-level attention scale perception neural network structure, including:

1: the front-end network extracts features. The first ten layers of VGG16 are used as feature extraction layers, kernel ═ 3, Conv2d convolution is adopted, and a Relu activation function is added after each convolution layer, wherein the number of layers is 64, 64, 128, 128, maxporoling (kernel ═ 2), 256, 256, 256, maxporoling (kernel ═ 2), 512, 512, 512. Depth features feature are extracted with this structure.

2: back-end network design.

3: multi-level attention scale aware neural network

As shown in fig. 3 to 6, the present invention provides a crowd counting method based on multi-level attention scale perception, further elaborating the multi-level attention scale perception module therein, including:

1: a pixel attention module is constructed. And performing two-dimensional convolution on the input image in, wherein the input channel is equal to the output channel, kernel is 1, then performing sigmoid function processing to obtain out, and finally outputting the out as the point product plus in of in and out. In this way, a weight parameter is added to each pixel point, so that the precision is improved.

2: and constructing a custom multi-level attention module. The flow of the channel attention upsampling module (UGA) is twice upsampling, the sum of self-adaptive average pooling and self-adaptive maximum pooling is taken, and the feature graph after twice upsampling is added through a Sigmoid function, so that the output feature graph of the module is obtained. The spatial attention upsampling module (USA) has twice upsampling flow, and is an output characteristic diagram of the module by connecting the average value and the maximum value of the channel layer surface characteristic diagram according to the channel, performing convolution with the convolution kernel size of 7 and the expansion rate of 3, performing Sigmoid function and adding the characteristic diagram obtained by twice upsampling. The pixel attention upsampling module (UPA) flow is two-dimensional convolution with a convolution kernel of 3 through double upsampling, two-dimensional convolution with a convolution kernel of one, output of the two-dimensional convolution with a convolution kernel of one and weight after Sigmoid function output, and convolution with a convolution kernel of one and a convolution kernel of 3, namely the output characteristic diagram of the module.

3: and constructing a custom dimension perception module SA. The input is x, and x channels are copied by c. Inputting four parallel different modes to extract features, wherein the size of a first layer convolution kernel is 3, and the cavity convolution variance is 1, so that f1 is obtained; setting different hole convolutions in the second layer to obtain f 2; setting different hole convolutions on the third layer to obtain f 3; the fourth layer sets up a different hole convolution to yield f 4. Adding f1 to the next layer f2 through a pixel attention module, wherein the number of channels after two-dimensional convolution is c, obtaining y1 through the pixel attention module, connecting with the next layer f3, the number of channels after two-dimensional convolution is c, obtaining y2 through the pixel attention module, connecting with the next layer f4, the number of channels after two-dimensional convolution is c, and obtaining y3 through the pixel attention module. And y1, y2 and y3 are connected according to channels, and then the two-dimensional convolution changes the number of the channels from 3c to obtain output y.

As shown in fig. 2, the present invention provides a people counting method based on multi-level attention scale perception, and further describes the application part thereof. The method comprises the steps of acquiring image data by using a camera, processing the image data into 768-1024 pixels, processing the image data into an RGB three-channel image if the image data is a gray image, loading a trained network and parameters thereof, and inputting the image to obtain the number of people to be predicted.

The invention can be used for the people flow detection system of large-scale gathering places and densely populated tourist places, shopping malls and the like, and the invention predicts the number of people in the current picture by using the single picture, and is more accurate especially under the condition of densely populated number.

Compared with the prior art, the invention has the beneficial effects that:

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

It will be apparent to those skilled in the art that various changes and modifications may be made in the invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A crowd counting method based on multi-level attention scale perception is characterized by comprising the following steps:

2. The method for people counting based on multi-level attention scale perception according to claim 1, wherein the step S1 comprises:

3. The method for people counting based on multi-level attention scale perception according to claim 1, wherein the step S2 comprises:

4. The method according to claim 3, wherein the step S22 comprises:

s222: constructing a pixel attention module: performing two-dimensional convolution on an input image in, wherein an input channel is equal to an output channel, kernel is 1, then performing sigmoid function processing to obtain out, and finally outputting a point product plus in of in and out, and adding a weight parameter for each pixel point in this way;

5. The method for people counting based on multi-level attention scale perception according to claim 1, wherein the step S3 comprises:

s32: inputting the processed Gaussian image into a neural network for training;

6. The method for people counting based on multi-level attention scale perception according to claim 1, wherein the step S4 comprises: