CN113283356A - Multi-level attention scale perception crowd counting method - Google Patents
Multi-level attention scale perception crowd counting method Download PDFInfo
- Publication number
- CN113283356A CN113283356A CN202110605990.1A CN202110605990A CN113283356A CN 113283356 A CN113283356 A CN 113283356A CN 202110605990 A CN202110605990 A CN 202110605990A CN 113283356 A CN113283356 A CN 113283356A
- Authority
- CN
- China
- Prior art keywords
- convolution
- module
- kernel
- attention
- network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000008447 perception Effects 0.000 title claims abstract description 44
- 238000000034 method Methods 0.000 title claims abstract description 34
- 238000012360 testing method Methods 0.000 claims abstract description 48
- 238000012549 training Methods 0.000 claims abstract description 39
- 238000013528 artificial neural network Methods 0.000 claims abstract description 37
- 230000006870 function Effects 0.000 claims description 44
- 238000010586 diagram Methods 0.000 claims description 22
- 230000004913 activation Effects 0.000 claims description 16
- 238000012545 processing Methods 0.000 claims description 15
- 238000011176 pooling Methods 0.000 claims description 14
- 238000000605 extraction Methods 0.000 claims description 8
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 5
- 238000007781 pre-processing Methods 0.000 claims description 4
- 230000008859 change Effects 0.000 claims description 3
- 238000011156 evaluation Methods 0.000 claims description 3
- 230000001502 supplementing effect Effects 0.000 claims description 3
- 238000013135 deep learning Methods 0.000 abstract description 3
- 238000001514 detection method Methods 0.000 abstract description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 230000001133 acceleration Effects 0.000 description 1
- 238000003066 decision tree Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/50—Context or environment of the image
- G06V20/52—Surveillance or monitoring of activities, e.g. for recognising suspicious objects
- G06V20/53—Recognition of crowd images, e.g. recognition of crowd congestion
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a multi-level attention scale perception crowd counting method, and belongs to application of deep learning in computer vision. The method comprises the following specific steps: s1: acquiring a data set; s2: constructing a multi-level attention scale perception neural network; s3: debugging and training a multi-level attention scale perception neural network and testing; s4: and acquiring a camera image, inputting the trained neural network for testing, and acquiring a predicted density map and the predicted number of people of the image. By the mode, the method can be suitable for detecting the number of people in a large-scale scene, and the accuracy of the detection result is effectively improved.
Description
Technical Field
The invention relates to a multi-level attention scale perception crowd counting method.
Background
With the acceleration of the national urbanization pace and the rapid development of urban economy, the crowd meeting scenes are increased, the number of tourists is increased, and meanwhile, the potential safety hazard is also accompanied. Therefore, by designing a crowd counting method, the number of crowds is predicted, the early warning is carried out on a highly crowded scene, related personnel can be helped to carry out early warning and decision-making before and after an emergency, the life and property safety of people can be guaranteed, and dangerous events are avoided.
The existing population counts are mainly divided into two types: 1) methods based on traditional methods, such as support vector machines, decision trees, etc.; 2) deep learning-based methods, such as the channel and collateral methods of Networks such as MCNN and CSRNet. The population counting method based on deep learning has certain limitation. The method 1) adopts the traditional method, and has high complexity and poor precision; the method 2) uses the existing neural network, and has the problems of low precision and the like.
Disclosure of Invention
The invention aims to provide a multi-level attention scale perception crowd counting method.
In order to solve the above problems, the present invention provides a multi-level attention scale perception crowd counting method, comprising:
s1: acquiring a data set and preprocessing the data set to obtain a density map of a training set and a density map of a testing set;
s2: constructing a backbone of a multi-level attention scale perception neural network;
s3: debugging and training the multi-level attention scale perception neural network and testing the effectiveness of the network based on the training set, the testing set and the backbone of the multi-level attention scale perception neural network to obtain the trained neural network;
s4: and acquiring a camera image, inputting the trained neural network for testing, and acquiring a predicted density map and the predicted number of people of the image.
Further, the step S1 includes:
s11: downloading a public data set, and dividing the public data set into a training set and a testing set;
s12: carrying out data enhancement on the training set and the test set, turning the image left and right, and increasing the data volume by one time to respectively obtain image data of the training set and the test set of the test set;
s13: supplementing the image data of the training set and the width and height pixels of the test set into multiples of 16, and proportionally adjusting the position of a positioning diagram to obtain a positioning diagram of a label of the training set and a positioning diagram of a label of the test set;
s14: and processing the positioning graph of the labels of the training set of the labels into a density graph of the training set by using a Gaussian kernel function with the Gaussian kernel size of 15, and processing the positioning graph of the labels of the test set of the labels into a density graph of the test set.
Further, the step S2 includes:
s21: designing the structure of the encoder for extracting features: taking the first ten layers of the VGG16 as feature extraction layers, kernel ═ 3, performing Conv2d convolution, adding a Relu activation function after each convolution layer, wherein the number of the layers is 64, 64, 128, 128, maxporoling (kernel ═ 2), 256, 256, maxporoling (kernel ═ 2), 512, 512, 512, extracting a depth feature by using the structure of the encoder, and loading VGG16 pre-training parameters;
s22: and designing a regression population density graph and the number of people of the decoder.
Further, the step S22 includes:
s221: the back-end backbone network is SA-UGA-SA-UPA-SA-USA, the scale perception module SA and kernel are 3, the number of input network layers is 512, the number of output network layers is 128, and then a Relu activation function is connected; a custom channel attention upsampling module UGA; two-dimensional convolution, wherein kernel is 3, the number of input network layers is 128, the number of output network layers is 64, and then a Relu activation function is connected; a custom spatial attention upsampling module USA; two-dimensional convolution, wherein kemel is 3, the number of input network layers is 64, the number of output network layers is 16, and then a Relu activation function is connected; a custom pixel attention upsampling module USA; the last layer is a full convolution network with 16 input network layers, 1 output network layer and 1 kernel, and is connected with a Relu activation function, and a residual error structure is added between SA and SA to finally output a prediction density graph;
s222: constructing a pixel attention module: performing two-dimensional convolution on an input image in, wherein an input channel is equal to an output channel, kernel is 1, then performing sigmoid function processing to obtain out, and finally outputting a point product plus in of in and out;
s223: constructing a custom multi-level attention module: the channel attention upsampling module (UGA) flow is twice upsampling, the sum of self-adaptive average pooling and self-adaptive maximum pooling is taken, and the feature graph obtained by twice upsampling is added to the sum of the self-adaptive average pooling and the self-adaptive maximum pooling through a Sigmoid function, so that the feature graph is the output feature graph of the module; the spatial attention upsampling module (USA) has twice upsampling flow, and is connected with the average value and the maximum value of the channel layer surface characteristic graph according to the channel, and the characteristic graph after twice upsampling is the output characteristic graph of the module after the convolution with the convolution kernel size of 7 and the expansion rate of 3 and the Sigmoid function are added; the pixel attention upsampling module (UPA) flow is two-dimensional convolution with a convolution kernel of 3 through double upsampling, two-dimensional convolution with a convolution kernel of one, output of the two-dimensional convolution with a convolution kernel of one and weight after Sigmoid function output, and convolution with a convolution kernel of one and a convolution kernel of 3, namely the output characteristic diagram of the module.
S224: and constructing a custom dimension perception module SA. The input is x, and the number c of x channels is copied; inputting four parallel different modes to extract features, wherein the size of a first layer convolution kernel is 3, and the cavity convolution variance is 1, so that f1 is obtained; setting different hole convolutions in the second layer to obtain f 2; setting different hole convolutions on the third layer to obtain f 3; the fourth layer sets up a different hole convolution to yield f 4. Adding f1 with the f2 of the next layer through a pixel attention module, wherein the number of channels after two-dimensional convolution is c, obtaining y1 through the pixel attention module, connecting with the f3 of the next layer, the number of channels after two-dimensional convolution is c, obtaining y2 through the pixel attention module, connecting with the f4 of the next layer, the number of channels after two-dimensional convolution is c, obtaining y3, y1, y2 and y3 through the pixel attention module, connecting according to the channels, and then performing two-dimensional convolution to change the number of channels from 3c to c, and obtaining output y.
Further, the step S3 includes:
s31: setting a loss function and parameters: the loss function uses mse mean square error, uses Adam optimizer, the blocksize is set to 1, the learning rate is 0.00001, the epoch is set to 1000;
s32: inputting the processed Gaussian image into a neural network for training;
s33: and loading the trained network parameters, testing the evaluation functions mae and mse by using the test set, and estimating the network performance.
Further, the step S4 includes:
s41: processing the picture acquired by the camera to be smaller than 768 times 1024 pixels to obtain a processed picture;
s42: and inputting the processed pictures into the trained neural network to obtain a corresponding predicted density map and the number y of the predicted persons.
Compared with the prior art, the method has the advantages that through S1: acquiring a data set; s2: constructing a multi-level attention scale perception neural network; s3: debugging and training a multi-level attention scale perception neural network and testing; s4: and acquiring a camera image, inputting the trained neural network for testing, and acquiring a predicted density map and the predicted number of people of the image. By the mode, the method can be suitable for detecting the number of people in a large-scale scene, and the accuracy of the detection result is effectively improved.
Compared with the prior art, the invention has the beneficial effects that:
1: the invention can carry out more accurate crowd density and quantity estimation on large-scale crowd;
2: the structure of a classical convolutional neural network is improved, a simple convolutional network layer is replaced by a multi-level attention module and a custom scale perception module, an Adam optimizer is used for optimizing an initial weight threshold of the neural network, the convergence speed of the network is accelerated, the network is close to the optimal parameters of the network, and the extraction of different characteristics by the network is enhanced;
3: according to the invention, on the basis of extracting the characteristics of the front ten layers of VGG16, the characteristic information of different spaces is further extracted through a user-defined scale sensing module, the attention of a network to dense people is improved, and the problem that the extraction of the characteristics of a single scale is not comprehensive enough is solved. The weights of the effective features under different scales are increased through multi-level attention, the background weight is weakened, and the regression performance is improved.
Drawings
FIG. 1 is a schematic flow chart structure diagram of a crowd counting method based on multi-level attention scale perception according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a multi-level attention-scale aware neural network architecture according to one embodiment of the present invention;
FIG. 3 is a schematic diagram of a channel attention upsampling module (UGA) of one embodiment of the present invention;
FIG. 4 is a schematic diagram of a spatial attention upsampling module (USA) of one embodiment of the present invention;
FIG. 5 is a schematic diagram of a pixel attention upsampling module (UPA) according to one embodiment of the present invention;
FIG. 6 is a schematic diagram of a scale-aware module architecture according to one embodiment of the present invention.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
As shown in fig. 1, the present invention provides a multi-level attention scale perception population counting method, comprising:
a crowd counting method based on multi-level attention scale perception is characterized by comprising the following steps:
s1: acquiring a data set and preprocessing the data set to obtain a density map of a training set and a density map of a testing set;
s2: constructing a backbone of a multi-level attention scale perception neural network;
s3: debugging and training the multi-level attention scale perception neural network and testing the effectiveness of the network based on the training set, the testing set and the backbone of the multi-level attention scale perception neural network to obtain the trained neural network;
s4: and acquiring a camera image, inputting the trained neural network for testing, and acquiring a predicted density map and the predicted number of people of the image.
The invention provides a method for perceiving a neural network by adopting a multi-scale, which can effectively extract the characteristics of crowds with different densities, and meanwhile, the attention of the network is concentrated to the region with dense crowds in a single picture by utilizing the attention of different scales, so that the problem that the characteristics of the characteristics extracted by a single scale are not rich is solved, and the practical significance of the characteristic maps of multiple levels on learning proper characteristic expression is enhanced by the attention of different scales.
Further, the step S1 includes:
s11: downloading a public data set, and dividing the public data set into a training set and a testing set;
s12: carrying out data enhancement on the training set and the test set, turning the image left and right, and increasing the data volume by one time to respectively obtain image data of the training set and the test set of the test set;
s13: supplementing the image data of the training set and the width and height pixels of the test set into multiples of 16, and proportionally adjusting the position of a positioning diagram to obtain a positioning diagram of a label of the training set and a positioning diagram of a label of the test set;
s14: and processing the positioning graph of the labels of the training set of the labels into a density graph of the training set by using a Gaussian kernel function with the Gaussian kernel size of 15, and processing the positioning graph of the labels of the test set of the labels into a density graph of the test set.
Further, the step S2 includes:
s21: designing the structure of the encoder for extracting features: taking the first ten layers of the VGG16 as feature extraction layers, kernel ═ 3, performing Conv2d convolution, adding a Relu activation function after each convolution layer, wherein the number of the layers is 64, 64, 128, 128, maxporoling (kernel ═ 2), 256, 256, maxporoling (kernel ═ 2), 512, 512, 512, extracting a depth feature by using the structure of the encoder, and loading VGG16 pre-training parameters;
s22: and designing a regression population density graph and the number of people of the decoder.
Further, the step S22 includes:
s221: the back-end backbone network is SA-UGA-SA-UPA-SA-USA, the scale perception module SA and kernel are 3, the number of input network layers is 512, the number of output network layers is 128, and then a Relu activation function is connected; a custom channel attention upsampling module UGA; two-dimensional convolution, wherein kernel is 3, the number of input network layers is 128, the number of output network layers is 64, and then a Relu activation function is connected; a custom spatial attention upsampling module USA; two-dimensional convolution, wherein kernel is 3, the number of input network layers is 64, the number of output network layers is 16, and then a Relu activation function is connected; a custom pixel attention upsampling module USA; the last layer is a full convolution network with 16 input network layers, 1 output network layer and 1 kernel, and is connected with a Relu activation function, and a residual error structure is added between SA and SA to finally output a prediction density graph;
s222: constructing a pixel attention module: performing two-dimensional convolution on an input image in, wherein an input channel is equal to an output channel, kernel is 1, then performing sigmoid function processing to obtain out, and finally outputting a point product plus in of in and out;
s223: constructing a custom multi-level attention module: the channel attention upsampling module (UGA) flow is twice upsampling, the sum of self-adaptive average pooling and self-adaptive maximum pooling is taken, and the feature graph obtained by twice upsampling is added to the sum of the self-adaptive average pooling and the self-adaptive maximum pooling through a Sigmoid function, so that the feature graph is the output feature graph of the module; the spatial attention upsampling module (USA) has twice upsampling flow, and is connected with the average value and the maximum value of the channel layer surface characteristic graph according to the channel, and the characteristic graph after twice upsampling is the output characteristic graph of the module after the convolution with the convolution kernel size of 7 and the expansion rate of 3 and the Sigmoid function are added; the pixel attention upsampling module (UPA) flow is two-dimensional convolution with a convolution kernel of 3 through double upsampling, two-dimensional convolution with a convolution kernel of one, output of the two-dimensional convolution with a convolution kernel of one and weight after Sigmoid function output, and convolution with a convolution kernel of one and a convolution kernel of 3, namely the output characteristic diagram of the module.
S224: and constructing a custom dimension perception module SA. The input is x, and the number c of x channels is copied; inputting four parallel different modes to extract features, wherein the size of a first layer convolution kernel is 3, and the cavity convolution variance is l, so that f1 is obtained; setting different hole convolutions in the second layer to obtain f 2; setting different hole convolutions on the third layer to obtain f 3; the fourth layer sets up a different hole convolution to yield f 4. Adding f1 with the f2 of the next layer through a pixel attention module, wherein the number of channels after two-dimensional convolution is c, obtaining y1 through the pixel attention module, connecting with the f3 of the next layer, the number of channels after two-dimensional convolution is c, obtaining y2 through the pixel attention module, connecting with the f4 of the next layer, the number of channels after two-dimensional convolution is c, obtaining y3, y1, y2 and y3 through the pixel attention module, connecting according to the channels, and then performing two-dimensional convolution to change the number of channels from 3c to c, and obtaining output y.
Further, the step S3 includes:
s31: setting a loss function and parameters: the loss function uses mse mean square error, uses Adam optimizer, the blocksize is set to 1, the learning rate is 0.00001, the epoch is set to 1000;
s32: inputting the processed Gaussian image into a neural network for training;
s33: and loading the trained network parameters, testing the evaluation functions mae and mse by using the test set, and estimating the network performance.
Further, the step S4 includes:
s41: processing the picture acquired by the camera to be smaller than 768 times 1024 pixels to obtain a processed picture;
s42: and inputting the processed pictures into the trained neural network to obtain a corresponding predicted density map and the number y of the predicted persons.
Specifically, as shown in fig. 1, the present invention provides a crowd counting method based on multi-level attention scale perception, including:
s1: acquiring a data set and preprocessing the data set;
s2: constructing a multi-level attention scale perception neural network backbone;
s3: debugging and training a multi-level attention scale perception neural network and testing the effectiveness of the network;
s4: and acquiring a camera image, inputting the trained neural network for testing, and acquiring a predicted density map and the predicted number of people of the image.
As shown in fig. 2, the present invention provides a crowd counting method based on multi-level attention scale perception, further elaborating the details of the multi-level attention scale perception neural network structure, including:
1: the front-end network extracts features. The first ten layers of VGG16 are used as feature extraction layers, kernel ═ 3, Conv2d convolution is adopted, and a Relu activation function is added after each convolution layer, wherein the number of layers is 64, 64, 128, 128, maxporoling (kernel ═ 2), 256, 256, 256, maxporoling (kernel ═ 2), 512, 512, 512. Depth features feature are extracted with this structure.
2: back-end network design.
3: multi-level attention scale aware neural network
As shown in fig. 3 to 6, the present invention provides a crowd counting method based on multi-level attention scale perception, further elaborating the multi-level attention scale perception module therein, including:
1: a pixel attention module is constructed. And performing two-dimensional convolution on the input image in, wherein the input channel is equal to the output channel, kernel is 1, then performing sigmoid function processing to obtain out, and finally outputting the out as the point product plus in of in and out. In this way, a weight parameter is added to each pixel point, so that the precision is improved.
2: and constructing a custom multi-level attention module. The flow of the channel attention upsampling module (UGA) is twice upsampling, the sum of self-adaptive average pooling and self-adaptive maximum pooling is taken, and the feature graph after twice upsampling is added through a Sigmoid function, so that the output feature graph of the module is obtained. The spatial attention upsampling module (USA) has twice upsampling flow, and is an output characteristic diagram of the module by connecting the average value and the maximum value of the channel layer surface characteristic diagram according to the channel, performing convolution with the convolution kernel size of 7 and the expansion rate of 3, performing Sigmoid function and adding the characteristic diagram obtained by twice upsampling. The pixel attention upsampling module (UPA) flow is two-dimensional convolution with a convolution kernel of 3 through double upsampling, two-dimensional convolution with a convolution kernel of one, output of the two-dimensional convolution with a convolution kernel of one and weight after Sigmoid function output, and convolution with a convolution kernel of one and a convolution kernel of 3, namely the output characteristic diagram of the module.
3: and constructing a custom dimension perception module SA. The input is x, and x channels are copied by c. Inputting four parallel different modes to extract features, wherein the size of a first layer convolution kernel is 3, and the cavity convolution variance is 1, so that f1 is obtained; setting different hole convolutions in the second layer to obtain f 2; setting different hole convolutions on the third layer to obtain f 3; the fourth layer sets up a different hole convolution to yield f 4. Adding f1 to the next layer f2 through a pixel attention module, wherein the number of channels after two-dimensional convolution is c, obtaining y1 through the pixel attention module, connecting with the next layer f3, the number of channels after two-dimensional convolution is c, obtaining y2 through the pixel attention module, connecting with the next layer f4, the number of channels after two-dimensional convolution is c, and obtaining y3 through the pixel attention module. And y1, y2 and y3 are connected according to channels, and then the two-dimensional convolution changes the number of the channels from 3c to obtain output y.
As shown in fig. 2, the present invention provides a people counting method based on multi-level attention scale perception, and further describes the application part thereof. The method comprises the steps of acquiring image data by using a camera, processing the image data into 768-1024 pixels, processing the image data into an RGB three-channel image if the image data is a gray image, loading a trained network and parameters thereof, and inputting the image to obtain the number of people to be predicted.
The invention can be used for the people flow detection system of large-scale gathering places and densely populated tourist places, shopping malls and the like, and the invention predicts the number of people in the current picture by using the single picture, and is more accurate especially under the condition of densely populated number.
Compared with the prior art, the invention has the beneficial effects that:
1: the invention can carry out more accurate crowd density and quantity estimation on large-scale crowd;
2: the structure of a classical convolutional neural network is improved, a simple convolutional network layer is replaced by a multi-level attention module and a custom scale perception module, an Adam optimizer is used for optimizing an initial weight threshold of the neural network, the convergence speed of the network is accelerated, the network is close to the optimal parameters of the network, and the extraction of different characteristics by the network is enhanced;
3: according to the invention, on the basis of extracting the characteristics of the front ten layers of VGG16, the characteristic information of different spaces is further extracted through a user-defined scale sensing module, the attention of a network to dense people is improved, and the problem that the extraction of the characteristics of a single scale is not comprehensive enough is solved. The weights of the effective features under different scales are increased through multi-level attention, the background weight is weakened, and the regression performance is improved.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
It will be apparent to those skilled in the art that various changes and modifications may be made in the invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.
Claims (6)
1. A crowd counting method based on multi-level attention scale perception is characterized by comprising the following steps:
s1: acquiring a data set and preprocessing the data set to obtain a density map of a training set and a density map of a testing set;
s2: constructing a backbone of a multi-level attention scale perception neural network;
s3: debugging and training the multi-level attention scale perception neural network and testing the effectiveness of the network based on the training set, the testing set and the backbone of the multi-level attention scale perception neural network to obtain the trained neural network;
s4: and acquiring a camera image, inputting the trained neural network for testing, and acquiring a predicted density map and the predicted number of people of the image.
2. The method for people counting based on multi-level attention scale perception according to claim 1, wherein the step S1 comprises:
s11: downloading a public data set, and dividing the public data set into a training set and a testing set;
s12: carrying out data enhancement on the training set and the test set, turning the image left and right, and increasing the data volume by one time to respectively obtain image data of the training set and the test set of the test set;
s13: supplementing the image data of the training set and the width and height pixels of the test set into multiples of 16, and proportionally adjusting the position of a positioning diagram to obtain a positioning diagram of a label of the training set and a positioning diagram of a label of the test set;
s14: and processing the positioning graph of the labels of the training set of the labels into a density graph of the training set by using a Gaussian kernel function with the Gaussian kernel size of 15, and processing the positioning graph of the labels of the test set of the labels into a density graph of the test set.
3. The method for people counting based on multi-level attention scale perception according to claim 1, wherein the step S2 comprises:
s21: designing the structure of the encoder for extracting features: taking the first ten layers of the VGG16 as feature extraction layers, kernel ═ 3, performing Conv2d convolution, adding a Relu activation function after each convolution layer, wherein the number of the layers is 64, 64, 128, 128, maxporoling (kernel ═ 2), 256, 256, maxporoling (kernel ═ 2), 512, 512, 512, extracting a depth feature by using the structure of the encoder, and loading VGG16 pre-training parameters;
s22: and designing a regression population density graph and the number of people of the decoder.
4. The method according to claim 3, wherein the step S22 comprises:
s221: the back-end backbone network is SA-UGA-SA-UPA-SA-USA, the scale perception module SA and kernel are 3, the number of input network layers is 512, the number of output network layers is 128, and then a Relu activation function is connected; a custom channel attention upsampling module UGA; two-dimensional convolution, wherein kernel is 3, the number of input network layers is 128, the number of output network layers is 64, and then a Relu activation function is connected; a custom spatial attention upsampling module USA; two-dimensional convolution, wherein kernel is 3, the number of input network layers is 64, the number of output network layers is 16, and then a Relu activation function is connected; a custom pixel attention upsampling module USA; the last layer is a full convolution network with 16 input network layers, 1 output network layer and 1 kernel, and is connected with a Relu activation function, and a residual error structure is added between SA and SA to finally output a prediction density graph;
s222: constructing a pixel attention module: performing two-dimensional convolution on an input image in, wherein an input channel is equal to an output channel, kernel is 1, then performing sigmoid function processing to obtain out, and finally outputting a point product plus in of in and out, and adding a weight parameter for each pixel point in this way;
s223: constructing a custom multi-level attention module: the channel attention upsampling module (UGA) flow is twice upsampling, the sum of self-adaptive average pooling and self-adaptive maximum pooling is taken, and the feature graph obtained by twice upsampling is added to the sum of the self-adaptive average pooling and the self-adaptive maximum pooling through a Sigmoid function, so that the feature graph is the output feature graph of the module; the spatial attention upsampling module (USA) has twice upsampling flow, and is connected with the average value and the maximum value of the channel layer surface characteristic graph according to the channel, and the characteristic graph after twice upsampling is the output characteristic graph of the module after the convolution with the convolution kernel size of 7 and the expansion rate of 3 and the Sigmoid function are added; the pixel attention upsampling module (UPA) flow is two-dimensional convolution with a convolution kernel of 3 through double upsampling, two-dimensional convolution with a convolution kernel of one, output of the two-dimensional convolution with a convolution kernel of one and weight after Sigmoid function output, and convolution with a convolution kernel of one and a convolution kernel of 3, namely the output characteristic diagram of the module.
S224: and constructing a custom dimension perception module SA. The input is x, and the number c of x channels is copied; inputting four parallel different modes to extract features, wherein the size of a first layer convolution kernel is 3, and the cavity convolution variance is 1, so that f1 is obtained; setting different hole convolutions in the second layer to obtain f 2; setting different hole convolutions on the third layer to obtain f 3; the fourth layer sets up a different hole convolution to yield f 4. Adding f1 with the f2 of the next layer through a pixel attention module, wherein the number of channels after two-dimensional convolution is c, obtaining y1 through the pixel attention module, connecting with the f3 of the next layer, the number of channels after two-dimensional convolution is c, obtaining y2 through the pixel attention module, connecting with the f4 of the next layer, the number of channels after two-dimensional convolution is c, obtaining y3, y1, y2 and y3 through the pixel attention module, connecting according to the channels, and then performing two-dimensional convolution to change the number of channels from 3c to c, and obtaining output y.
5. The method for people counting based on multi-level attention scale perception according to claim 1, wherein the step S3 comprises:
s31: setting a loss function and parameters: the loss function uses mse mean square error, uses Adam optimizer, the blocksize is set to 1, the learning rate is 0.00001, the epoch is set to 1000;
s32: inputting the processed Gaussian image into a neural network for training;
s33: and loading the trained network parameters, testing the evaluation functions mae and mse by using the test set, and estimating the network performance.
6. The method for people counting based on multi-level attention scale perception according to claim 1, wherein the step S4 comprises:
s41: processing the picture acquired by the camera to be smaller than 768 times 1024 pixels to obtain a processed picture;
s42: and inputting the processed pictures into the trained neural network to obtain a corresponding predicted density map and the number y of the predicted persons.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110605990.1A CN113283356B (en) | 2021-05-31 | 2021-05-31 | Multistage attention scale perception crowd counting method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110605990.1A CN113283356B (en) | 2021-05-31 | 2021-05-31 | Multistage attention scale perception crowd counting method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113283356A true CN113283356A (en) | 2021-08-20 |
CN113283356B CN113283356B (en) | 2024-04-05 |
Family
ID=77282919
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110605990.1A Active CN113283356B (en) | 2021-05-31 | 2021-05-31 | Multistage attention scale perception crowd counting method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113283356B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114399728A (en) * | 2021-12-17 | 2022-04-26 | 燕山大学 | Method for counting crowds in foggy day scene |
CN114511636A (en) * | 2022-04-20 | 2022-05-17 | 科大天工智能装备技术(天津)有限公司 | Fruit counting method and system based on double-filtering attention module |
CN115019211A (en) * | 2022-06-28 | 2022-09-06 | 北京理工大学 | Segmentation guide attention group counting method for aerial images of unmanned aerial vehicle |
CN115880588A (en) * | 2021-09-13 | 2023-03-31 | 国家电网有限公司 | Two-stage unmanned aerial vehicle detection method combined with time domain |
CN117253184A (en) * | 2023-08-25 | 2023-12-19 | 燕山大学 | Foggy day image crowd counting method guided by foggy priori frequency domain attention characterization |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110188685A (en) * | 2019-05-30 | 2019-08-30 | 燕山大学 | A kind of object count method and system based on the multiple dimensioned cascade network of double attentions |
WO2020169043A1 (en) * | 2019-02-21 | 2020-08-27 | 苏州大学 | Dense crowd counting method, apparatus and device, and storage medium |
CN112132023A (en) * | 2020-09-22 | 2020-12-25 | 上海应用技术大学 | Crowd counting method based on multi-scale context enhanced network |
CN112597964A (en) * | 2020-12-30 | 2021-04-02 | 上海应用技术大学 | Method for counting layered multi-scale crowd |
CN112668532A (en) * | 2021-01-05 | 2021-04-16 | 重庆大学 | Crowd counting method based on multi-stage mixed attention network |
-
2021
- 2021-05-31 CN CN202110605990.1A patent/CN113283356B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020169043A1 (en) * | 2019-02-21 | 2020-08-27 | 苏州大学 | Dense crowd counting method, apparatus and device, and storage medium |
CN110188685A (en) * | 2019-05-30 | 2019-08-30 | 燕山大学 | A kind of object count method and system based on the multiple dimensioned cascade network of double attentions |
CN112132023A (en) * | 2020-09-22 | 2020-12-25 | 上海应用技术大学 | Crowd counting method based on multi-scale context enhanced network |
CN112597964A (en) * | 2020-12-30 | 2021-04-02 | 上海应用技术大学 | Method for counting layered multi-scale crowd |
CN112668532A (en) * | 2021-01-05 | 2021-04-16 | 重庆大学 | Crowd counting method based on multi-stage mixed attention network |
Non-Patent Citations (1)
Title |
---|
马骞;: "基于通道域注意力机制的人群密度估计算法研究", 电子设计工程, no. 15 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115880588A (en) * | 2021-09-13 | 2023-03-31 | 国家电网有限公司 | Two-stage unmanned aerial vehicle detection method combined with time domain |
CN114399728A (en) * | 2021-12-17 | 2022-04-26 | 燕山大学 | Method for counting crowds in foggy day scene |
CN114399728B (en) * | 2021-12-17 | 2023-12-05 | 燕山大学 | Foggy scene crowd counting method |
CN114511636A (en) * | 2022-04-20 | 2022-05-17 | 科大天工智能装备技术(天津)有限公司 | Fruit counting method and system based on double-filtering attention module |
CN115019211A (en) * | 2022-06-28 | 2022-09-06 | 北京理工大学 | Segmentation guide attention group counting method for aerial images of unmanned aerial vehicle |
CN117253184A (en) * | 2023-08-25 | 2023-12-19 | 燕山大学 | Foggy day image crowd counting method guided by foggy priori frequency domain attention characterization |
CN117253184B (en) * | 2023-08-25 | 2024-05-17 | 燕山大学 | Foggy day image crowd counting method guided by foggy priori frequency domain attention characterization |
Also Published As
Publication number | Publication date |
---|---|
CN113283356B (en) | 2024-04-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113283356A (en) | Multi-level attention scale perception crowd counting method | |
CN108256562B (en) | Salient target detection method and system based on weak supervision time-space cascade neural network | |
Sindagi et al. | Cnn-based cascaded multi-task learning of high-level prior and density estimation for crowd counting | |
CN110210551B (en) | Visual target tracking method based on adaptive subject sensitivity | |
CN110119703B (en) | Human body action recognition method fusing attention mechanism and spatio-temporal graph convolutional neural network in security scene | |
CN108764085B (en) | Crowd counting method based on generation of confrontation network | |
CN112396607B (en) | Deformable convolution fusion enhanced street view image semantic segmentation method | |
CN112597964B (en) | Method for counting layered multi-scale crowd | |
CN111723693B (en) | Crowd counting method based on small sample learning | |
CN108805002B (en) | Monitoring video abnormal event detection method based on deep learning and dynamic clustering | |
CN111738054B (en) | Behavior anomaly detection method based on space-time self-encoder network and space-time CNN | |
WO2019136591A1 (en) | Salient object detection method and system for weak supervision-based spatio-temporal cascade neural network | |
CN110084201B (en) | Human body action recognition method based on convolutional neural network of specific target tracking in monitoring scene | |
CN107506792B (en) | Semi-supervised salient object detection method | |
CN110827265B (en) | Image anomaly detection method based on deep learning | |
Hu et al. | Parallel spatial-temporal convolutional neural networks for anomaly detection and location in crowded scenes | |
CN113936235A (en) | Video saliency target detection method based on quality evaluation | |
CN114663665A (en) | Gradient-based confrontation sample generation method and system | |
CN112215241B (en) | Image feature extraction device based on small sample learning | |
CN112132867B (en) | Remote sensing image change detection method and device | |
CN111753714B (en) | Multidirectional natural scene text detection method based on character segmentation | |
CN111428809B (en) | Crowd counting method based on spatial information fusion and convolutional neural network | |
CN111401209B (en) | Action recognition method based on deep learning | |
CN116403152A (en) | Crowd density estimation method based on spatial context learning network | |
CN114494999B (en) | Double-branch combined target intensive prediction method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |