CN113283356B - Multistage attention scale perception crowd counting method - Google Patents

Multistage attention scale perception crowd counting method Download PDF

Info

Publication number
CN113283356B
CN113283356B CN202110605990.1A CN202110605990A CN113283356B CN 113283356 B CN113283356 B CN 113283356B CN 202110605990 A CN202110605990 A CN 202110605990A CN 113283356 B CN113283356 B CN 113283356B
Authority
CN
China
Prior art keywords
convolution
module
kernel
attention
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110605990.1A
Other languages
Chinese (zh)
Other versions
CN113283356A (en
Inventor
祝鲁宁
黄良军
沈世晖
张亚妮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Institute of Technology
Original Assignee
Shanghai Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Institute of Technology filed Critical Shanghai Institute of Technology
Priority to CN202110605990.1A priority Critical patent/CN113283356B/en
Publication of CN113283356A publication Critical patent/CN113283356A/en
Application granted granted Critical
Publication of CN113283356B publication Critical patent/CN113283356B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • G06V20/53Recognition of crowd images, e.g. recognition of crowd congestion

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a multi-level attention scale perception crowd counting method, and belongs to the application of deep learning in computer vision. The method comprises the following specific steps: s1: acquiring a data set; s2: constructing a multistage attention scale sensing neural network; s3: debugging and training a multi-level attention scale sensing neural network and testing; s4: and acquiring a camera image, inputting a trained neural network for testing, and obtaining a predicted density map and the number of predicted persons of the image. By means of the method, the method and the device for detecting the crowd quantity can be suitable for crowd quantity detection in a large-scale scene, and accuracy of detection results is effectively improved.

Description

Multistage attention scale perception crowd counting method
Technical Field
The invention relates to a multi-level attention scale perception crowd counting method.
Background
With the acceleration of national urban steps and the rapid development of urban economy, crowd gathering scenes are increased, tourists are increased, and potential safety hazards are accompanied. Therefore, by designing a crowd counting method, the crowd quantity is predicted, the early warning is carried out on a highly crowded scene, the early warning and the post decision of emergency can be carried out on related personnel, the life and property safety of people can be ensured, and the occurrence of dangerous events is avoided.
Currently, the existing population counts are mainly divided into two types: 1) Methods based on conventional methods, such as support vector machines, decision trees, etc.; 2) Deep learning-based methods such as MCNN, CSRNet and other methods of networking and channels. The crowd counting method based on deep learning has certain limitations. The method 1) uses the traditional method, and has high complexity and poor precision; the method 2) uses the existing neural network, and has the problems of lower precision and the like.
Disclosure of Invention
The invention aims to provide a multi-level attention scale perception crowd counting method.
In order to solve the above problems, the present invention provides a multi-level attention scale sensing crowd counting method, comprising:
s1: acquiring a data set and preprocessing to obtain a density map of a training set and a density map of a test set;
s2: constructing a backbone of a multi-level attention scale sensing neural network;
s3: debugging and training the multi-level attention scale aware neural network and testing the network effectiveness based on the training set, the testing set and the trunk of the multi-level attention scale aware neural network to obtain a trained neural network;
s4: and acquiring a camera image, inputting a trained neural network for testing, and obtaining a predicted density map and the number of predicted persons of the image.
Further, the step S1 includes:
s11: downloading a public data set, and dividing the public data set into a training set and a testing set;
s12: the training set and the test set are subjected to data enhancement, the image is overturned left and right, and the data quantity is doubled, so that the image data of the training set and the test set of the test set are obtained respectively;
s13: supplementing the image data of the training set, the training set of the test set and the wide and high pixels of the test set to be multiples of 16, and adjusting the position of the positioning map proportionally to obtain the positioning map of the label of the training set and the positioning map of the label of the test set;
s14: and processing the positioning map of the labels of the training set of the labels into a density map of the training set by using a Gaussian kernel function with the Gaussian kernel size of 15, and processing the positioning map of the labels of the testing set of the labels into a density map of the testing set.
Further, the step S2 includes:
s21: design the structure of the encoder that extracts the features: taking the first ten layers of VGG16 as feature extraction layers, kernel=3, adopting Conv2d convolution, adding a Relu activation function after each convolution layer, wherein the layers are 64, 64, 128, maxpooling (kernel=2), 256, 256, 256, and maxpooling (kernel=2), 512, 512, 512, and extracting depth feature features by using the structure of the encoder, and loading VGG16 pre-training parameters;
s22: and designing a regression population density map and the number of people of the decoder.
Further, the step S22 includes:
s221: the back end backbone network is SA-UGA-SA-UPA-SA-USA, the scale perception module SA, the kernel is 3, the input network layer number is 512, the output network layer number is 128, and a Relu activation function is connected; a self-defined channel attention up-sampling module UGA; two-dimensional convolution, wherein kernel is 3, the number of input network layers is 128, the number of output network layers is 64, and a Relu activation function is connected; a self-defined spatial attention upsampling module USA; two-dimensional convolution, wherein kemel is 3, the number of input network layers is 64, the number of output network layers is 16, and a Relu activation function is connected; a custom pixel attention up-sampling module USA; the last layer is a full convolution network with the number of input network layers being 16, the number of output network layers being 1 and kernel being 1, and is connected with a Relu activation function, and in addition, a residual error structure is added between SA and SA, and finally a prediction density map is output;
s222: building a pixel attention module: carrying out two-dimensional convolution on an input image in, wherein an input channel is equal to an output channel, a kernel is 1, a sigmoid function is connected to process the input image in to obtain out, and finally, the input image in and the out are outputted as point multiplication and addition in, and a weight parameter is added for each pixel point in the mode, so that the precision is improved;
s223: constructing a custom multi-level attention module: the channel attention upsampling module (UGA) process is double upsampling, the sum of self-adaptive average pooling and self-adaptive maximum pooling is taken, and the output characteristic diagram of the module is obtained by a Sigmoid function and a characteristic diagram after the previous double upsampling; the spatial attention upsampling module (USA) processes are double upsampling, the average value and the maximum value of the channel layer characteristic diagrams are connected according to the channel, the convolution is 7 in the convolution kernel size, the expansion rate is 3, the characteristic diagrams after the previous double upsampling are added through the Sigmoid function, and the output characteristic diagrams of the module are obtained; the pixel attention up-sampling module (UPA) process is two-dimensional convolution with double up-sampling and convolution kernel of 3, two-dimensional convolution with convolution kernel of one, the output of the two-dimensional convolution with the weight after the output of the Sigmoid function and the convolution kernel of one, and the convolution with the two-dimensional convolution with the convolution kernel of one and the convolution kernel of 3 are adopted, so that the output characteristic diagram of the module is obtained.
S224: and constructing a custom scale perception module SA. The input is x, copy x channel number c; inputting four parallel different modes to extract features, wherein the size of a first layer convolution kernel is 3, and the cavity convolution condition is 1 to obtain f1; setting different cavity convolutions on the second layer to obtain f2; setting different cavity convolutions on a third layer to obtain f3; and setting different cavity convolutions on the fourth layer to obtain f4. And f1 is added with the next layer f2 through a pixel attention module, the number of channels after two-dimensional convolution is c, y1 is obtained through the pixel attention module, then the next layer f3 is connected with the next layer f1, the number of channels after two-dimensional convolution is c, y2 is obtained through the pixel attention module, then the next layer f4 is connected with the next layer f1, the number of channels after two-dimensional convolution is c, y3, y1, y2 and y3 are obtained through the pixel attention module, and the number of channels is changed from 3c to c through two-dimensional convolution after the channel connection, so that output y is obtained.
Further, the step S3 includes:
s31: loss function and parameter setting: the loss function uses mse mean square error, uses Adam optimizer, the batch size is set to 1, the learning rate is 0.00001, and the epoch is set to 1000;
s32: inputting the processed Gaussian image into a neural network for training;
s33: and loading the trained network parameters, testing the sizes of evaluation functions mae and mse by using a test set, and estimating the network performance.
Further, the step S4 includes:
s41: processing the picture acquired by the camera to be less than 768 by 1024 pixels to obtain a processed picture;
s42: and inputting the processed picture into a trained neural network to obtain a corresponding predicted density map and a predicted population y.
Compared with the prior art, the invention has the following advantages that the S1: acquiring a data set; s2: constructing a multistage attention scale sensing neural network; s3: debugging and training a multi-level attention scale sensing neural network and testing; s4: and acquiring a camera image, inputting a trained neural network for testing, and obtaining a predicted density map and the number of predicted persons of the image. By means of the method, the method and the device for detecting the crowd quantity can be suitable for crowd quantity detection in a large-scale scene, and accuracy of detection results is effectively improved.
Compared with the prior art, the invention has the beneficial effects that:
1: the invention can more accurately estimate the crowd density and the crowd quantity for a large-scale crowd;
2: the invention improves the structure of the classical convolutional neural network, replaces a simple convolutional network layer by the multi-level attention module and the custom scale perception module, optimizes the initial weight threshold of the neural network by using the Adam optimizer, accelerates the convergence rate of the network, is close to the optimal parameters of the network, and enhances the extraction of different characteristics by the network;
3: the invention further extracts the characteristic information of different spaces through the custom scale perception module on the basis of extracting the characteristics of the first ten layers of VGG16, improves the attention of the network to the dense crowd, and solves the problem that the single scale characteristic extraction is not comprehensive enough. The weight of the effective features under different scales is increased through multistage attention, the background weight is weakened, and the regression performance is improved.
Drawings
FIG. 1 is a schematic flow chart of a crowd counting method based on multi-level attention scale perception according to one embodiment of the invention;
FIG. 2 is a schematic diagram of a multi-level attention scale aware neural network structure according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a channel attention upsampling module (UGA) of one embodiment of the present invention;
FIG. 4 is a schematic diagram of a spatial attention upsampling module (USA) of one embodiment of the present invention;
FIG. 5 is a schematic diagram of a pixel attention upsampling module (UPA) of one embodiment of the present invention;
FIG. 6 is a schematic diagram of a scale-aware module structure according to an embodiment of this invention.
Detailed Description
In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.
As shown in fig. 1, the present invention provides a multi-level attention scale sensing crowd counting method, comprising:
a multi-level attention scale based crowd counting method, comprising:
s1: acquiring a data set and preprocessing to obtain a density map of a training set and a density map of a test set;
s2: constructing a backbone of a multi-level attention scale sensing neural network;
s3: debugging and training the multi-level attention scale aware neural network and testing the network effectiveness based on the training set, the testing set and the trunk of the multi-level attention scale aware neural network to obtain a trained neural network;
s4: and acquiring a camera image, inputting a trained neural network for testing, and obtaining a predicted density map and the number of predicted persons of the image.
The invention provides a method for adopting a multi-scale perception neural network, which can effectively extract the characteristics of crowds with different concentrations, and simultaneously concentrate the attention of the network to the areas with dense crowds in a single picture by using the attention of different scales, so as to solve the problem that the characteristics of the extracted characteristics of a single scale are not abundant, and strengthen the practical significance of the characteristic patterns of a plurality of layers for learning proper characteristic expression by the attention of different scales.
Further, the step S1 includes:
s11: downloading a public data set, and dividing the public data set into a training set and a testing set;
s12: the training set and the test set are subjected to data enhancement, the image is overturned left and right, and the data quantity is doubled, so that the image data of the training set and the test set of the test set are obtained respectively;
s13: supplementing the image data of the training set, the training set of the test set and the wide and high pixels of the test set to be multiples of 16, and adjusting the position of the positioning map proportionally to obtain the positioning map of the label of the training set and the positioning map of the label of the test set;
s14: and processing the positioning map of the labels of the training set of the labels into a density map of the training set by using a Gaussian kernel function with the Gaussian kernel size of 15, and processing the positioning map of the labels of the testing set of the labels into a density map of the testing set.
Further, the step S2 includes:
s21: design the structure of the encoder that extracts the features: taking the first ten layers of VGG16 as feature extraction layers, kernel=3, adopting Conv2d convolution, adding a Relu activation function after each convolution layer, wherein the layers are 64, 64, 128, maxpooling (kernel=2), 256, 256, 256, and maxpooling (kernel=2), 512, 512, 512, and extracting depth feature features by using the structure of the encoder, and loading VGG16 pre-training parameters;
s22: and designing a regression population density map and the number of people of the decoder.
Further, the step S22 includes:
s221: the back end backbone network is SA-UGA-SA-UPA-SA-USA, the scale perception module SA, the kernel is 3, the input network layer number is 512, the output network layer number is 128, and a Relu activation function is connected; a self-defined channel attention up-sampling module UGA; two-dimensional convolution, wherein kernel is 3, the number of input network layers is 128, the number of output network layers is 64, and a Relu activation function is connected; a self-defined spatial attention upsampling module USA; two-dimensional convolution, wherein kernel is 3, the number of input network layers is 64, the number of output network layers is 16, and a Relu activation function is connected; a custom pixel attention up-sampling module USA; the last layer is a full convolution network with the number of input network layers being 16, the number of output network layers being 1 and kernel being 1, and is connected with a Relu activation function, and in addition, a residual error structure is added between SA and SA, and finally a prediction density map is output;
s222: building a pixel attention module: carrying out two-dimensional convolution on an input image in, wherein an input channel is equal to an output channel, a kernel is 1, a sigmoid function is connected to process the input image in to obtain out, and finally, the input image in and the out are outputted as point multiplication and addition in, and a weight parameter is added for each pixel point in the mode, so that the precision is improved;
s223: constructing a custom multi-level attention module: the channel attention upsampling module (UGA) process is double upsampling, the sum of self-adaptive average pooling and self-adaptive maximum pooling is taken, and the output characteristic diagram of the module is obtained by a Sigmoid function and a characteristic diagram after the previous double upsampling; the spatial attention upsampling module (USA) processes are double upsampling, the average value and the maximum value of the channel layer characteristic diagrams are connected according to the channel, the convolution is 7 in the convolution kernel size, the expansion rate is 3, the characteristic diagrams after the previous double upsampling are added through the Sigmoid function, and the output characteristic diagrams of the module are obtained; the pixel attention up-sampling module (UPA) process is two-dimensional convolution with double up-sampling and convolution kernel of 3, two-dimensional convolution with convolution kernel of one, the output of the two-dimensional convolution with the weight after the output of the Sigmoid function and the convolution kernel of one, and the convolution with the two-dimensional convolution with the convolution kernel of one and the convolution kernel of 3 are adopted, so that the output characteristic diagram of the module is obtained.
S224: and constructing a custom scale perception module SA. The input is x, copy x channel number c; inputting four parallel different modes to extract features, wherein the size of a first layer convolution kernel is 3, and the cavity convolution condition is l to obtain f1; setting different cavity convolutions on the second layer to obtain f2; setting different cavity convolutions on a third layer to obtain f3; and setting different cavity convolutions on the fourth layer to obtain f4. And f1 is added with the next layer f2 through a pixel attention module, the number of channels after two-dimensional convolution is c, y1 is obtained through the pixel attention module, then the next layer f3 is connected with the next layer f1, the number of channels after two-dimensional convolution is c, y2 is obtained through the pixel attention module, then the next layer f4 is connected with the next layer f1, the number of channels after two-dimensional convolution is c, y3, y1, y2 and y3 are obtained through the pixel attention module, and the number of channels is changed from 3c to c through two-dimensional convolution after the channel connection, so that output y is obtained.
Further, the step S3 includes:
s31: loss function and parameter setting: the loss function uses mse mean square error, uses Adam optimizer, the batch size is set to 1, the learning rate is 0.00001, and the epoch is set to 1000;
s32: inputting the processed Gaussian image into a neural network for training;
s33: and loading the trained network parameters, testing the sizes of evaluation functions mae and mse by using a test set, and estimating the network performance.
Further, the step S4 includes:
s41: processing the picture acquired by the camera to be less than 768 by 1024 pixels to obtain a processed picture;
s42: and inputting the processed picture into a trained neural network to obtain a corresponding predicted density map and a predicted population y.
Specifically, as shown in fig. 1, the present invention provides a crowd counting method based on multi-level attention scale sensing, including:
s1: acquiring a data set and preprocessing;
s2: constructing a multi-level attention scale sensing neural network backbone;
s3: debugging and training a multi-level attention scale aware neural network and testing network effectiveness;
s4: and acquiring a camera image, inputting a trained neural network for testing, and obtaining a predicted density map and the number of predicted persons of the image.
As shown in fig. 2, the present invention provides a crowd counting method based on multi-level attention scale sensing, further describing structural details of a multi-level attention scale sensing neural network, including:
1: the front-end network extracts features. The first ten layers of VGG16 were taken as feature extraction layers, kernel=3, conv2d convolution was used, and each convolution layer was followed by a Relu activation function with layers 64, 64, 128, maxpooling (kernel=2), 256, 256, 256, maxpooling (kernel=2), 512, 512, 512. Depth feature is extracted with this structure.
2: and (5) back-end network design.
3: multistage attention scale sensing neural network
As shown in fig. 3 to 6, the present invention provides a crowd counting method based on multi-level attention scale sensing, further describing a multi-level attention scale sensing module thereof, including:
1: a pixel attention module is constructed. And carrying out two-dimensional convolution on the input image in, wherein an input channel is equal to an output channel, the kernel is 1, then, a sigmoid function is connected to process to obtain out, and finally, the point multiplication of the input image in and the point multiplication of the output image out are added in. In this way, a weight parameter is added to each pixel point, so that the precision is improved.
2: a custom multi-level attention module is constructed. The channel attention upsampling module (UGA) process is double upsampling, the sum of the adaptive average pooling and the adaptive maximum pooling is taken, and the output characteristic diagram of the module is obtained by adding the characteristic diagram after the previous double upsampling to the Sigmoid function. The spatial attention upsampling module (USA) processes are double upsampling, the average value and the maximum value of the channel layer characteristic diagrams are connected according to the channel, convolution with the convolution kernel size of 7 and the expansion rate of 3 is carried out, the characteristic diagrams after the previous double upsampling are added through a Sigmoid function, and the output characteristic diagram of the module is obtained. The pixel attention up-sampling module (UPA) process is two-dimensional convolution with double up-sampling and convolution kernel of 3, two-dimensional convolution with convolution kernel of one, the output of the two-dimensional convolution with the weight after the output of the Sigmoid function and the convolution kernel of one, and the convolution with the two-dimensional convolution with the convolution kernel of one and the convolution kernel of 3 are adopted, so that the output characteristic diagram of the module is obtained.
3: and constructing a custom scale perception module SA. The input is x, the number of x channels c is replicated. Inputting four parallel different modes to extract features, wherein the size of a first layer convolution kernel is 3, and the cavity convolution condition is 1 to obtain f1; setting different cavity convolutions on the second layer to obtain f2; setting different cavity convolutions on a third layer to obtain f3; and setting different cavity convolutions on the fourth layer to obtain f4. And f1 is added with the next layer f2 through a pixel attention module, the number of channels after two-dimensional convolution is c, y1 is obtained through the pixel attention module, then the next layer f3 is connected with the two-dimensional convolution, y2 is obtained through the pixel attention module, then the next layer f4 is connected with the two-dimensional convolution, the number of channels after two-dimensional convolution is c, and y3 is obtained through the pixel attention module. And (3) carrying out two-dimensional convolution on the y1, y2 and y3 after the channel connection to change the channel number from 3c to c, so as to obtain an output y.
As shown in fig. 2, the present invention provides a crowd counting method based on multi-level attention scale sensing, and the application part is further described. The method comprises the steps of obtaining image data by using a camera, processing the image data into 768 times 1024 pixels, processing the image data into RGB three-channel images if the image data are gray images, loading a trained network and parameters thereof, and inputting pictures to obtain the number of people to be predicted.
The invention can be used for people flow detection systems of large-scale gatherings, tourist sites, markets and the like with dense crowd, and can be used for predicting the number of people in the current picture by utilizing a single picture, and particularly, the invention is more accurate under the condition of dense number of people.
Compared with the prior art, the invention has the beneficial effects that:
1: the invention can more accurately estimate the crowd density and the crowd quantity for a large-scale crowd;
2: the invention improves the structure of the classical convolutional neural network, replaces a simple convolutional network layer by the multi-level attention module and the custom scale perception module, optimizes the initial weight threshold of the neural network by using the Adam optimizer, accelerates the convergence rate of the network, is close to the optimal parameters of the network, and enhances the extraction of different characteristics by the network;
3: the invention further extracts the characteristic information of different spaces through the custom scale perception module on the basis of extracting the characteristics of the first ten layers of VGG16, improves the attention of the network to the dense crowd, and solves the problem that the single scale characteristic extraction is not comprehensive enough. The weight of the effective features under different scales is increased through multistage attention, the background weight is weakened, and the regression performance is improved.
In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims (3)

1. A multi-level attention scale based crowd counting method, comprising:
s1: acquiring a data set and preprocessing to obtain a density map of a training set and a density map of a test set;
s2: constructing a backbone of a multi-level attention scale sensing neural network;
s3: debugging and training the multi-level attention scale aware neural network and testing the network effectiveness based on the training set, the testing set and the trunk of the multi-level attention scale aware neural network to obtain a trained neural network;
s4: acquiring a camera image, inputting a trained neural network for testing, and obtaining a predicted density map and the number of predicted persons of the image;
the step S1 includes:
s11: downloading a public data set, and dividing the public data set into a training set and a testing set;
s12: the training set and the test set are subjected to data enhancement, the image is overturned left and right, and the data quantity is doubled, so that the image data of the training set and the test set of the test set are obtained respectively;
s13: supplementing the image data of the training set, the training set of the test set and the wide and high pixels of the test set to be multiples of 16, and adjusting the position of the positioning map proportionally to obtain the positioning map of the label of the training set and the positioning map of the label of the test set;
s14: processing a positioning map of a label of a training set of the label into a density map of the training set by using a Gaussian kernel function with the Gaussian kernel size of 15, and processing a positioning map of a label of a testing set of the label into a density map of the testing set;
the step S2 includes:
s21: design the structure of the encoder that extracts the features: taking the first ten layers of VGG16 as feature extraction layers, kernel=3, adopting Conv2d convolution, adding a Relu activation function after each convolution layer, wherein the layers are 64, 64, 128, maxpooling, kernel=2, 256, 256, 256, maxpooling, kernel=2, 512, 512, 512, and extracting depth feature features by using the structure of the encoder, and loading VGG16 pre-training parameters;
s22: designing a regression population density map and the number of people of the decoder;
the step S22 includes:
s221: the back end backbone network is SA-UGA-SA-UPA-SA-USA, the scale perception module SA, the kernel is 3, the input network layer number is 512, the output network layer number is 128, and a Relu activation function is connected; a self-defined channel attention up-sampling module UGA; two-dimensional convolution, wherein kernel is 3, the number of input network layers is 128, the number of output network layers is 64, and a Relu activation function is connected; a self-defined spatial attention upsampling module USA; two-dimensional convolution, wherein kernel is 3, the number of input network layers is 64, the number of output network layers is 16, and a Relu activation function is connected; a custom pixel attention up-sampling module USA; the last layer is a full convolution network with the number of input network layers being 16, the number of output network layers being 1 and kernel being 1, and is connected with a Relu activation function, and in addition, a residual error structure is added between SA and SA, and finally a prediction density map is output;
s222: building a pixel attention module: carrying out two-dimensional convolution on an input image in, wherein an input channel is equal to an output channel, a kernel is 1, a sigmoid function is connected to process the input image in to obtain out, and finally, the input image in is output as a point multiplication sum in of in and out, and a weight parameter is added for each pixel point in the mode;
s223: constructing a custom multi-level attention module: the channel attention upsampling module (UGA) process is double upsampling, the sum of self-adaptive average pooling and self-adaptive maximum pooling is taken, and the output characteristic diagram of the module is obtained by a Sigmoid function and a characteristic diagram after the previous double upsampling; the spatial attention upsampling module (USA) processes are double upsampling, the average value and the maximum value of the channel layer characteristic diagrams are connected according to the channel, the convolution is 7 in the convolution kernel size, the expansion rate is 3, the characteristic diagrams after the previous double upsampling are added through the Sigmoid function, and the output characteristic diagrams of the module are obtained; the pixel attention up-sampling module (UPA) process is a two-dimensional convolution with double up-sampling and a convolution kernel of 3, a two-dimensional convolution with a convolution kernel of one, the weight output by the Sigmoid function is added with the output of the two-dimensional convolution with the convolution kernel of one, and a two-dimensional convolution with the convolution kernel of one and a convolution with the convolution kernel of 3 are adopted, so that an output characteristic diagram of the module is obtained;
s224: constructing a custom scale perception module SA; the input is x, copy x channel number c; inputting four parallel different modes to extract features, wherein the size of a first layer convolution kernel is 3, and the cavity convolution condition is 1 to obtain f1; setting different cavity convolutions on the second layer to obtain f2; setting different cavity convolutions on a third layer to obtain f3; setting different cavity convolutions on a fourth layer to obtain f4; and f1 is added with the next layer f2 through a pixel attention module, the number of channels after two-dimensional convolution is c, y1 is obtained through the pixel attention module, then the next layer f3 is connected with the next layer f1, the number of channels after two-dimensional convolution is c, y2 is obtained through the pixel attention module, then the next layer f4 is connected with the next layer f1, the number of channels after two-dimensional convolution is c, y3, y1, y2 and y3 are obtained through the pixel attention module, and the number of channels is changed from 3c to c through two-dimensional convolution after the channel connection, so that output y is obtained.
2. The method of claim 1, wherein the step S3 comprises:
s31: loss function and parameter setting: the loss function uses mse mean square error, uses Adam optimizer, the batch size is set to 1, the learning rate is 0.00001, and the epoch is set to 1000;
s32: inputting the processed Gaussian image into a neural network for training;
s33: and loading the trained network parameters, testing the sizes of evaluation functions mae and mse by using a test set, and estimating the network performance.
3. The method of claim 1, wherein the step S4 comprises:
s41: processing the picture acquired by the camera to be less than 768 by 1024 pixels to obtain a processed picture;
s42: and inputting the processed picture into a trained neural network to obtain a corresponding predicted density map and a predicted population y.
CN202110605990.1A 2021-05-31 2021-05-31 Multistage attention scale perception crowd counting method Active CN113283356B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110605990.1A CN113283356B (en) 2021-05-31 2021-05-31 Multistage attention scale perception crowd counting method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110605990.1A CN113283356B (en) 2021-05-31 2021-05-31 Multistage attention scale perception crowd counting method

Publications (2)

Publication Number Publication Date
CN113283356A CN113283356A (en) 2021-08-20
CN113283356B true CN113283356B (en) 2024-04-05

Family

ID=77282919

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110605990.1A Active CN113283356B (en) 2021-05-31 2021-05-31 Multistage attention scale perception crowd counting method

Country Status (1)

Country Link
CN (1) CN113283356B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115880588A (en) * 2021-09-13 2023-03-31 国家电网有限公司 Two-stage unmanned aerial vehicle detection method combined with time domain
CN114399728B (en) * 2021-12-17 2023-12-05 燕山大学 Foggy scene crowd counting method
CN114511636B (en) * 2022-04-20 2022-07-12 科大天工智能装备技术(天津)有限公司 Fruit counting method and system based on double-filtering attention module

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110188685A (en) * 2019-05-30 2019-08-30 燕山大学 A kind of object count method and system based on the multiple dimensioned cascade network of double attentions
WO2020169043A1 (en) * 2019-02-21 2020-08-27 苏州大学 Dense crowd counting method, apparatus and device, and storage medium
CN112132023A (en) * 2020-09-22 2020-12-25 上海应用技术大学 Crowd counting method based on multi-scale context enhanced network
CN112597964A (en) * 2020-12-30 2021-04-02 上海应用技术大学 Method for counting layered multi-scale crowd
CN112668532A (en) * 2021-01-05 2021-04-16 重庆大学 Crowd counting method based on multi-stage mixed attention network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020169043A1 (en) * 2019-02-21 2020-08-27 苏州大学 Dense crowd counting method, apparatus and device, and storage medium
CN110188685A (en) * 2019-05-30 2019-08-30 燕山大学 A kind of object count method and system based on the multiple dimensioned cascade network of double attentions
CN112132023A (en) * 2020-09-22 2020-12-25 上海应用技术大学 Crowd counting method based on multi-scale context enhanced network
CN112597964A (en) * 2020-12-30 2021-04-02 上海应用技术大学 Method for counting layered multi-scale crowd
CN112668532A (en) * 2021-01-05 2021-04-16 重庆大学 Crowd counting method based on multi-stage mixed attention network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于通道域注意力机制的人群密度估计算法研究;马骞;;电子设计工程(15);全文 *

Also Published As

Publication number Publication date
CN113283356A (en) 2021-08-20

Similar Documents

Publication Publication Date Title
CN113283356B (en) Multistage attention scale perception crowd counting method
CN108256562B (en) Salient target detection method and system based on weak supervision time-space cascade neural network
CN112396607B (en) Deformable convolution fusion enhanced street view image semantic segmentation method
CN111738054B (en) Behavior anomaly detection method based on space-time self-encoder network and space-time CNN
CN112861690A (en) Multi-method fused remote sensing image change detection method and system
CN110826428A (en) Ship detection method in high-speed SAR image
CN112597964B (en) Method for counting layered multi-scale crowd
CN110827265B (en) Image anomaly detection method based on deep learning
CN110020658B (en) Salient object detection method based on multitask deep learning
CN111062381B (en) License plate position detection method based on deep learning
CN113888547A (en) Non-supervision domain self-adaptive remote sensing road semantic segmentation method based on GAN network
Cho et al. Semantic segmentation with low light images by modified CycleGAN-based image enhancement
CN116152591B (en) Model training method, infrared small target detection method and device and electronic equipment
CN115035371A (en) Borehole wall crack identification method based on multi-scale feature fusion neural network
CN114663665A (en) Gradient-based confrontation sample generation method and system
CN112132867B (en) Remote sensing image change detection method and device
CN111753714B (en) Multidirectional natural scene text detection method based on character segmentation
CN111626197B (en) Recognition method based on human behavior recognition network model
CN111401209B (en) Action recognition method based on deep learning
CN116403152A (en) Crowd density estimation method based on spatial context learning network
CN115953736A (en) Crowd density estimation method based on video monitoring and deep neural network
CN112215241B (en) Image feature extraction device based on small sample learning
CN115578624A (en) Agricultural disease and pest model construction method, detection method and device
CN113205078B (en) Crowd counting method based on multi-branch progressive attention-strengthening
CN115346115A (en) Image target detection method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant