CN113283356B

CN113283356B - Multistage attention scale perception crowd counting method

Info

Publication number: CN113283356B
Application number: CN202110605990.1A
Authority: CN
Inventors: 祝鲁宁; 黄良军; 沈世晖; 张亚妮
Original assignee: Shanghai Institute of Technology
Current assignee: Shanghai Institute of Technology
Priority date: 2021-05-31
Filing date: 2021-05-31
Publication date: 2024-04-05
Anticipated expiration: 2041-05-31
Also published as: CN113283356A

Abstract

The invention provides a multi-level attention scale perception crowd counting method, and belongs to the application of deep learning in computer vision. The method comprises the following specific steps: s1: acquiring a data set; s2: constructing a multistage attention scale sensing neural network; s3: debugging and training a multi-level attention scale sensing neural network and testing; s4: and acquiring a camera image, inputting a trained neural network for testing, and obtaining a predicted density map and the number of predicted persons of the image. By means of the method, the method and the device for detecting the crowd quantity can be suitable for crowd quantity detection in a large-scale scene, and accuracy of detection results is effectively improved.

Description

Multistage attention scale perception crowd counting method

Technical Field

The invention relates to a multi-level attention scale perception crowd counting method.

Background

With the acceleration of national urban steps and the rapid development of urban economy, crowd gathering scenes are increased, tourists are increased, and potential safety hazards are accompanied. Therefore, by designing a crowd counting method, the crowd quantity is predicted, the early warning is carried out on a highly crowded scene, the early warning and the post decision of emergency can be carried out on related personnel, the life and property safety of people can be ensured, and the occurrence of dangerous events is avoided.

Currently, the existing population counts are mainly divided into two types: 1) Methods based on conventional methods, such as support vector machines, decision trees, etc.; 2) Deep learning-based methods such as MCNN, CSRNet and other methods of networking and channels. The crowd counting method based on deep learning has certain limitations. The method 1) uses the traditional method, and has high complexity and poor precision; the method 2) uses the existing neural network, and has the problems of lower precision and the like.

Disclosure of Invention

The invention aims to provide a multi-level attention scale perception crowd counting method.

In order to solve the above problems, the present invention provides a multi-level attention scale sensing crowd counting method, comprising:

s1: acquiring a data set and preprocessing to obtain a density map of a training set and a density map of a test set;

s2: constructing a backbone of a multi-level attention scale sensing neural network;

s3: debugging and training the multi-level attention scale aware neural network and testing the network effectiveness based on the training set, the testing set and the trunk of the multi-level attention scale aware neural network to obtain a trained neural network;

s4: and acquiring a camera image, inputting a trained neural network for testing, and obtaining a predicted density map and the number of predicted persons of the image.

Further, the step S1 includes:

s11: downloading a public data set, and dividing the public data set into a training set and a testing set;

s12: the training set and the test set are subjected to data enhancement, the image is overturned left and right, and the data quantity is doubled, so that the image data of the training set and the test set of the test set are obtained respectively;

s13: supplementing the image data of the training set, the training set of the test set and the wide and high pixels of the test set to be multiples of 16, and adjusting the position of the positioning map proportionally to obtain the positioning map of the label of the training set and the positioning map of the label of the test set;

s14: and processing the positioning map of the labels of the training set of the labels into a density map of the training set by using a Gaussian kernel function with the Gaussian kernel size of 15, and processing the positioning map of the labels of the testing set of the labels into a density map of the testing set.

Further, the step S2 includes:

s21: design the structure of the encoder that extracts the features: taking the first ten layers of VGG16 as feature extraction layers, kernel=3, adopting Conv2d convolution, adding a Relu activation function after each convolution layer, wherein the layers are 64, 64, 128, maxpooling (kernel=2), 256, 256, 256, and maxpooling (kernel=2), 512, 512, 512, and extracting depth feature features by using the structure of the encoder, and loading VGG16 pre-training parameters;

s22: and designing a regression population density map and the number of people of the decoder.

Further, the step S22 includes:

s221: the back end backbone network is SA-UGA-SA-UPA-SA-USA, the scale perception module SA, the kernel is 3, the input network layer number is 512, the output network layer number is 128, and a Relu activation function is connected; a self-defined channel attention up-sampling module UGA; two-dimensional convolution, wherein kernel is 3, the number of input network layers is 128, the number of output network layers is 64, and a Relu activation function is connected; a self-defined spatial attention upsampling module USA; two-dimensional convolution, wherein kemel is 3, the number of input network layers is 64, the number of output network layers is 16, and a Relu activation function is connected; a custom pixel attention up-sampling module USA; the last layer is a full convolution network with the number of input network layers being 16, the number of output network layers being 1 and kernel being 1, and is connected with a Relu activation function, and in addition, a residual error structure is added between SA and SA, and finally a prediction density map is output;

s222: building a pixel attention module: carrying out two-dimensional convolution on an input image in, wherein an input channel is equal to an output channel, a kernel is 1, a sigmoid function is connected to process the input image in to obtain out, and finally, the input image in and the out are outputted as point multiplication and addition in, and a weight parameter is added for each pixel point in the mode, so that the precision is improved;

s223: constructing a custom multi-level attention module: the channel attention upsampling module (UGA) process is double upsampling, the sum of self-adaptive average pooling and self-adaptive maximum pooling is taken, and the output characteristic diagram of the module is obtained by a Sigmoid function and a characteristic diagram after the previous double upsampling; the spatial attention upsampling module (USA) processes are double upsampling, the average value and the maximum value of the channel layer characteristic diagrams are connected according to the channel, the convolution is 7 in the convolution kernel size, the expansion rate is 3, the characteristic diagrams after the previous double upsampling are added through the Sigmoid function, and the output characteristic diagrams of the module are obtained; the pixel attention up-sampling module (UPA) process is two-dimensional convolution with double up-sampling and convolution kernel of 3, two-dimensional convolution with convolution kernel of one, the output of the two-dimensional convolution with the weight after the output of the Sigmoid function and the convolution kernel of one, and the convolution with the two-dimensional convolution with the convolution kernel of one and the convolution kernel of 3 are adopted, so that the output characteristic diagram of the module is obtained.

S224: and constructing a custom scale perception module SA. The input is x, copy x channel number c; inputting four parallel different modes to extract features, wherein the size of a first layer convolution kernel is 3, and the cavity convolution condition is 1 to obtain f1; setting different cavity convolutions on the second layer to obtain f2; setting different cavity convolutions on a third layer to obtain f3; and setting different cavity convolutions on the fourth layer to obtain f4. And f1 is added with the next layer f2 through a pixel attention module, the number of channels after two-dimensional convolution is c, y1 is obtained through the pixel attention module, then the next layer f3 is connected with the next layer f1, the number of channels after two-dimensional convolution is c, y2 is obtained through the pixel attention module, then the next layer f4 is connected with the next layer f1, the number of channels after two-dimensional convolution is c, y3, y1, y2 and y3 are obtained through the pixel attention module, and the number of channels is changed from 3c to c through two-dimensional convolution after the channel connection, so that output y is obtained.

Further, the step S3 includes:

s31: loss function and parameter setting: the loss function uses mse mean square error, uses Adam optimizer, the batch size is set to 1, the learning rate is 0.00001, and the epoch is set to 1000;

s32: inputting the processed Gaussian image into a neural network for training;

s33: and loading the trained network parameters, testing the sizes of evaluation functions mae and mse by using a test set, and estimating the network performance.

Further, the step S4 includes:

s41: processing the picture acquired by the camera to be less than 768 by 1024 pixels to obtain a processed picture;

s42: and inputting the processed picture into a trained neural network to obtain a corresponding predicted density map and a predicted population y.

Compared with the prior art, the invention has the following advantages that the S1: acquiring a data set; s2: constructing a multistage attention scale sensing neural network; s3: debugging and training a multi-level attention scale sensing neural network and testing; s4: and acquiring a camera image, inputting a trained neural network for testing, and obtaining a predicted density map and the number of predicted persons of the image. By means of the method, the method and the device for detecting the crowd quantity can be suitable for crowd quantity detection in a large-scale scene, and accuracy of detection results is effectively improved.

Compared with the prior art, the invention has the beneficial effects that:

1: the invention can more accurately estimate the crowd density and the crowd quantity for a large-scale crowd;

2: the invention improves the structure of the classical convolutional neural network, replaces a simple convolutional network layer by the multi-level attention module and the custom scale perception module, optimizes the initial weight threshold of the neural network by using the Adam optimizer, accelerates the convergence rate of the network, is close to the optimal parameters of the network, and enhances the extraction of different characteristics by the network;

3: the invention further extracts the characteristic information of different spaces through the custom scale perception module on the basis of extracting the characteristics of the first ten layers of VGG16, improves the attention of the network to the dense crowd, and solves the problem that the single scale characteristic extraction is not comprehensive enough. The weight of the effective features under different scales is increased through multistage attention, the background weight is weakened, and the regression performance is improved.

Drawings

FIG. 1 is a schematic flow chart of a crowd counting method based on multi-level attention scale perception according to one embodiment of the invention;

FIG. 2 is a schematic diagram of a multi-level attention scale aware neural network structure according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a channel attention upsampling module (UGA) of one embodiment of the present invention;

FIG. 4 is a schematic diagram of a spatial attention upsampling module (USA) of one embodiment of the present invention;

FIG. 5 is a schematic diagram of a pixel attention upsampling module (UPA) of one embodiment of the present invention;

FIG. 6 is a schematic diagram of a scale-aware module structure according to an embodiment of this invention.

Detailed Description

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

As shown in fig. 1, the present invention provides a multi-level attention scale sensing crowd counting method, comprising:

a multi-level attention scale based crowd counting method, comprising:

The invention provides a method for adopting a multi-scale perception neural network, which can effectively extract the characteristics of crowds with different concentrations, and simultaneously concentrate the attention of the network to the areas with dense crowds in a single picture by using the attention of different scales, so as to solve the problem that the characteristics of the extracted characteristics of a single scale are not abundant, and strengthen the practical significance of the characteristic patterns of a plurality of layers for learning proper characteristic expression by the attention of different scales.

Further, the step S1 includes:

Further, the step S2 includes:

Further, the step S22 includes:

s221: the back end backbone network is SA-UGA-SA-UPA-SA-USA, the scale perception module SA, the kernel is 3, the input network layer number is 512, the output network layer number is 128, and a Relu activation function is connected; a self-defined channel attention up-sampling module UGA; two-dimensional convolution, wherein kernel is 3, the number of input network layers is 128, the number of output network layers is 64, and a Relu activation function is connected; a self-defined spatial attention upsampling module USA; two-dimensional convolution, wherein kernel is 3, the number of input network layers is 64, the number of output network layers is 16, and a Relu activation function is connected; a custom pixel attention up-sampling module USA; the last layer is a full convolution network with the number of input network layers being 16, the number of output network layers being 1 and kernel being 1, and is connected with a Relu activation function, and in addition, a residual error structure is added between SA and SA, and finally a prediction density map is output;

S224: and constructing a custom scale perception module SA. The input is x, copy x channel number c; inputting four parallel different modes to extract features, wherein the size of a first layer convolution kernel is 3, and the cavity convolution condition is l to obtain f1; setting different cavity convolutions on the second layer to obtain f2; setting different cavity convolutions on a third layer to obtain f3; and setting different cavity convolutions on the fourth layer to obtain f4. And f1 is added with the next layer f2 through a pixel attention module, the number of channels after two-dimensional convolution is c, y1 is obtained through the pixel attention module, then the next layer f3 is connected with the next layer f1, the number of channels after two-dimensional convolution is c, y2 is obtained through the pixel attention module, then the next layer f4 is connected with the next layer f1, the number of channels after two-dimensional convolution is c, y3, y1, y2 and y3 are obtained through the pixel attention module, and the number of channels is changed from 3c to c through two-dimensional convolution after the channel connection, so that output y is obtained.

Further, the step S3 includes:

s32: inputting the processed Gaussian image into a neural network for training;

Further, the step S4 includes:

Specifically, as shown in fig. 1, the present invention provides a crowd counting method based on multi-level attention scale sensing, including:

s1: acquiring a data set and preprocessing;

s2: constructing a multi-level attention scale sensing neural network backbone;

s3: debugging and training a multi-level attention scale aware neural network and testing network effectiveness;

As shown in fig. 2, the present invention provides a crowd counting method based on multi-level attention scale sensing, further describing structural details of a multi-level attention scale sensing neural network, including:

1: the front-end network extracts features. The first ten layers of VGG16 were taken as feature extraction layers, kernel=3, conv2d convolution was used, and each convolution layer was followed by a Relu activation function with layers 64, 64, 128, maxpooling (kernel=2), 256, 256, 256, maxpooling (kernel=2), 512, 512, 512. Depth feature is extracted with this structure.

2: and (5) back-end network design.

3: multistage attention scale sensing neural network

As shown in fig. 3 to 6, the present invention provides a crowd counting method based on multi-level attention scale sensing, further describing a multi-level attention scale sensing module thereof, including:

1: a pixel attention module is constructed. And carrying out two-dimensional convolution on the input image in, wherein an input channel is equal to an output channel, the kernel is 1, then, a sigmoid function is connected to process to obtain out, and finally, the point multiplication of the input image in and the point multiplication of the output image out are added in. In this way, a weight parameter is added to each pixel point, so that the precision is improved.

2: a custom multi-level attention module is constructed. The channel attention upsampling module (UGA) process is double upsampling, the sum of the adaptive average pooling and the adaptive maximum pooling is taken, and the output characteristic diagram of the module is obtained by adding the characteristic diagram after the previous double upsampling to the Sigmoid function. The spatial attention upsampling module (USA) processes are double upsampling, the average value and the maximum value of the channel layer characteristic diagrams are connected according to the channel, convolution with the convolution kernel size of 7 and the expansion rate of 3 is carried out, the characteristic diagrams after the previous double upsampling are added through a Sigmoid function, and the output characteristic diagram of the module is obtained. The pixel attention up-sampling module (UPA) process is two-dimensional convolution with double up-sampling and convolution kernel of 3, two-dimensional convolution with convolution kernel of one, the output of the two-dimensional convolution with the weight after the output of the Sigmoid function and the convolution kernel of one, and the convolution with the two-dimensional convolution with the convolution kernel of one and the convolution kernel of 3 are adopted, so that the output characteristic diagram of the module is obtained.

3: and constructing a custom scale perception module SA. The input is x, the number of x channels c is replicated. Inputting four parallel different modes to extract features, wherein the size of a first layer convolution kernel is 3, and the cavity convolution condition is 1 to obtain f1; setting different cavity convolutions on the second layer to obtain f2; setting different cavity convolutions on a third layer to obtain f3; and setting different cavity convolutions on the fourth layer to obtain f4. And f1 is added with the next layer f2 through a pixel attention module, the number of channels after two-dimensional convolution is c, y1 is obtained through the pixel attention module, then the next layer f3 is connected with the two-dimensional convolution, y2 is obtained through the pixel attention module, then the next layer f4 is connected with the two-dimensional convolution, the number of channels after two-dimensional convolution is c, and y3 is obtained through the pixel attention module. And (3) carrying out two-dimensional convolution on the y1, y2 and y3 after the channel connection to change the channel number from 3c to c, so as to obtain an output y.

As shown in fig. 2, the present invention provides a crowd counting method based on multi-level attention scale sensing, and the application part is further described. The method comprises the steps of obtaining image data by using a camera, processing the image data into 768 times 1024 pixels, processing the image data into RGB three-channel images if the image data are gray images, loading a trained network and parameters thereof, and inputting pictures to obtain the number of people to be predicted.

The invention can be used for people flow detection systems of large-scale gatherings, tourist sites, markets and the like with dense crowd, and can be used for predicting the number of people in the current picture by utilizing a single picture, and particularly, the invention is more accurate under the condition of dense number of people.

Compared with the prior art, the invention has the beneficial effects that:

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A multi-level attention scale based crowd counting method, comprising:

s4: acquiring a camera image, inputting a trained neural network for testing, and obtaining a predicted density map and the number of predicted persons of the image;

the step S1 includes:

s14: processing a positioning map of a label of a training set of the label into a density map of the training set by using a Gaussian kernel function with the Gaussian kernel size of 15, and processing a positioning map of a label of a testing set of the label into a density map of the testing set;

the step S2 includes:

s21: design the structure of the encoder that extracts the features: taking the first ten layers of VGG16 as feature extraction layers, kernel=3, adopting Conv2d convolution, adding a Relu activation function after each convolution layer, wherein the layers are 64, 64, 128, maxpooling, kernel=2, 256, 256, 256, maxpooling, kernel=2, 512, 512, 512, and extracting depth feature features by using the structure of the encoder, and loading VGG16 pre-training parameters;

s22: designing a regression population density map and the number of people of the decoder;

the step S22 includes:

s222: building a pixel attention module: carrying out two-dimensional convolution on an input image in, wherein an input channel is equal to an output channel, a kernel is 1, a sigmoid function is connected to process the input image in to obtain out, and finally, the input image in is output as a point multiplication sum in of in and out, and a weight parameter is added for each pixel point in the mode;

s223: constructing a custom multi-level attention module: the channel attention upsampling module (UGA) process is double upsampling, the sum of self-adaptive average pooling and self-adaptive maximum pooling is taken, and the output characteristic diagram of the module is obtained by a Sigmoid function and a characteristic diagram after the previous double upsampling; the spatial attention upsampling module (USA) processes are double upsampling, the average value and the maximum value of the channel layer characteristic diagrams are connected according to the channel, the convolution is 7 in the convolution kernel size, the expansion rate is 3, the characteristic diagrams after the previous double upsampling are added through the Sigmoid function, and the output characteristic diagrams of the module are obtained; the pixel attention up-sampling module (UPA) process is a two-dimensional convolution with double up-sampling and a convolution kernel of 3, a two-dimensional convolution with a convolution kernel of one, the weight output by the Sigmoid function is added with the output of the two-dimensional convolution with the convolution kernel of one, and a two-dimensional convolution with the convolution kernel of one and a convolution with the convolution kernel of 3 are adopted, so that an output characteristic diagram of the module is obtained;

s224: constructing a custom scale perception module SA; the input is x, copy x channel number c; inputting four parallel different modes to extract features, wherein the size of a first layer convolution kernel is 3, and the cavity convolution condition is 1 to obtain f1; setting different cavity convolutions on the second layer to obtain f2; setting different cavity convolutions on a third layer to obtain f3; setting different cavity convolutions on a fourth layer to obtain f4; and f1 is added with the next layer f2 through a pixel attention module, the number of channels after two-dimensional convolution is c, y1 is obtained through the pixel attention module, then the next layer f3 is connected with the next layer f1, the number of channels after two-dimensional convolution is c, y2 is obtained through the pixel attention module, then the next layer f4 is connected with the next layer f1, the number of channels after two-dimensional convolution is c, y3, y1, y2 and y3 are obtained through the pixel attention module, and the number of channels is changed from 3c to c through two-dimensional convolution after the channel connection, so that output y is obtained.

2. The method of claim 1, wherein the step S3 comprises:

s32: inputting the processed Gaussian image into a neural network for training;

3. The method of claim 1, wherein the step S4 comprises: