CN113283356B - Multistage attention scale perception crowd counting method - Google Patents
Multistage attention scale perception crowd counting method Download PDFInfo
- Publication number
- CN113283356B CN113283356B CN202110605990.1A CN202110605990A CN113283356B CN 113283356 B CN113283356 B CN 113283356B CN 202110605990 A CN202110605990 A CN 202110605990A CN 113283356 B CN113283356 B CN 113283356B
- Authority
- CN
- China
- Prior art keywords
- convolution
- module
- kernel
- attention
- network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 47
- 230000008447 perception Effects 0.000 title claims abstract description 18
- 238000012360 testing method Methods 0.000 claims abstract description 51
- 238000012549 training Methods 0.000 claims abstract description 42
- 238000013528 artificial neural network Methods 0.000 claims abstract description 37
- 230000006870 function Effects 0.000 claims description 44
- 238000010586 diagram Methods 0.000 claims description 29
- 230000004913 activation Effects 0.000 claims description 16
- 230000008569 process Effects 0.000 claims description 16
- 238000005070 sampling Methods 0.000 claims description 14
- 238000012545 processing Methods 0.000 claims description 11
- 238000000605 extraction Methods 0.000 claims description 8
- 238000011176 pooling Methods 0.000 claims description 8
- 239000000284 extract Substances 0.000 claims description 7
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 5
- 238000013461 design Methods 0.000 claims description 5
- 238000007781 pre-processing Methods 0.000 claims description 4
- 238000011156 evaluation Methods 0.000 claims description 3
- 230000001502 supplementing effect Effects 0.000 claims description 3
- 238000001514 detection method Methods 0.000 abstract description 5
- 238000013135 deep learning Methods 0.000 abstract description 3
- 230000003044 adaptive effect Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000001133 acceleration Effects 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 239000012141 concentrate Substances 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000003066 decision tree Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/50—Context or environment of the image
- G06V20/52—Surveillance or monitoring of activities, e.g. for recognising suspicious objects
- G06V20/53—Recognition of crowd images, e.g. recognition of crowd congestion
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a multi-level attention scale perception crowd counting method, and belongs to the application of deep learning in computer vision. The method comprises the following specific steps: s1: acquiring a data set; s2: constructing a multistage attention scale sensing neural network; s3: debugging and training a multi-level attention scale sensing neural network and testing; s4: and acquiring a camera image, inputting a trained neural network for testing, and obtaining a predicted density map and the number of predicted persons of the image. By means of the method, the method and the device for detecting the crowd quantity can be suitable for crowd quantity detection in a large-scale scene, and accuracy of detection results is effectively improved.
Description
Technical Field
The invention relates to a multi-level attention scale perception crowd counting method.
Background
With the acceleration of national urban steps and the rapid development of urban economy, crowd gathering scenes are increased, tourists are increased, and potential safety hazards are accompanied. Therefore, by designing a crowd counting method, the crowd quantity is predicted, the early warning is carried out on a highly crowded scene, the early warning and the post decision of emergency can be carried out on related personnel, the life and property safety of people can be ensured, and the occurrence of dangerous events is avoided.
Currently, the existing population counts are mainly divided into two types: 1) Methods based on conventional methods, such as support vector machines, decision trees, etc.; 2) Deep learning-based methods such as MCNN, CSRNet and other methods of networking and channels. The crowd counting method based on deep learning has certain limitations. The method 1) uses the traditional method, and has high complexity and poor precision; the method 2) uses the existing neural network, and has the problems of lower precision and the like.
Disclosure of Invention
The invention aims to provide a multi-level attention scale perception crowd counting method.
In order to solve the above problems, the present invention provides a multi-level attention scale sensing crowd counting method, comprising:
s1: acquiring a data set and preprocessing to obtain a density map of a training set and a density map of a test set;
s2: constructing a backbone of a multi-level attention scale sensing neural network;
s3: debugging and training the multi-level attention scale aware neural network and testing the network effectiveness based on the training set, the testing set and the trunk of the multi-level attention scale aware neural network to obtain a trained neural network;
s4: and acquiring a camera image, inputting a trained neural network for testing, and obtaining a predicted density map and the number of predicted persons of the image.
Further, the step S1 includes:
s11: downloading a public data set, and dividing the public data set into a training set and a testing set;
s12: the training set and the test set are subjected to data enhancement, the image is overturned left and right, and the data quantity is doubled, so that the image data of the training set and the test set of the test set are obtained respectively;
s13: supplementing the image data of the training set, the training set of the test set and the wide and high pixels of the test set to be multiples of 16, and adjusting the position of the positioning map proportionally to obtain the positioning map of the label of the training set and the positioning map of the label of the test set;
s14: and processing the positioning map of the labels of the training set of the labels into a density map of the training set by using a Gaussian kernel function with the Gaussian kernel size of 15, and processing the positioning map of the labels of the testing set of the labels into a density map of the testing set.
Further, the step S2 includes:
s21: design the structure of the encoder that extracts the features: taking the first ten layers of VGG16 as feature extraction layers, kernel=3, adopting Conv2d convolution, adding a Relu activation function after each convolution layer, wherein the layers are 64, 64, 128, maxpooling (kernel=2), 256, 256, 256, and maxpooling (kernel=2), 512, 512, 512, and extracting depth feature features by using the structure of the encoder, and loading VGG16 pre-training parameters;
s22: and designing a regression population density map and the number of people of the decoder.
Further, the step S22 includes:
s221: the back end backbone network is SA-UGA-SA-UPA-SA-USA, the scale perception module SA, the kernel is 3, the input network layer number is 512, the output network layer number is 128, and a Relu activation function is connected; a self-defined channel attention up-sampling module UGA; two-dimensional convolution, wherein kernel is 3, the number of input network layers is 128, the number of output network layers is 64, and a Relu activation function is connected; a self-defined spatial attention upsampling module USA; two-dimensional convolution, wherein kemel is 3, the number of input network layers is 64, the number of output network layers is 16, and a Relu activation function is connected; a custom pixel attention up-sampling module USA; the last layer is a full convolution network with the number of input network layers being 16, the number of output network layers being 1 and kernel being 1, and is connected with a Relu activation function, and in addition, a residual error structure is added between SA and SA, and finally a prediction density map is output;
s222: building a pixel attention module: carrying out two-dimensional convolution on an input image in, wherein an input channel is equal to an output channel, a kernel is 1, a sigmoid function is connected to process the input image in to obtain out, and finally, the input image in and the out are outputted as point multiplication and addition in, and a weight parameter is added for each pixel point in the mode, so that the precision is improved;
s223: constructing a custom multi-level attention module: the channel attention upsampling module (UGA) process is double upsampling, the sum of self-adaptive average pooling and self-adaptive maximum pooling is taken, and the output characteristic diagram of the module is obtained by a Sigmoid function and a characteristic diagram after the previous double upsampling; the spatial attention upsampling module (USA) processes are double upsampling, the average value and the maximum value of the channel layer characteristic diagrams are connected according to the channel, the convolution is 7 in the convolution kernel size, the expansion rate is 3, the characteristic diagrams after the previous double upsampling are added through the Sigmoid function, and the output characteristic diagrams of the module are obtained; the pixel attention up-sampling module (UPA) process is two-dimensional convolution with double up-sampling and convolution kernel of 3, two-dimensional convolution with convolution kernel of one, the output of the two-dimensional convolution with the weight after the output of the Sigmoid function and the convolution kernel of one, and the convolution with the two-dimensional convolution with the convolution kernel of one and the convolution kernel of 3 are adopted, so that the output characteristic diagram of the module is obtained.
S224: and constructing a custom scale perception module SA. The input is x, copy x channel number c; inputting four parallel different modes to extract features, wherein the size of a first layer convolution kernel is 3, and the cavity convolution condition is 1 to obtain f1; setting different cavity convolutions on the second layer to obtain f2; setting different cavity convolutions on a third layer to obtain f3; and setting different cavity convolutions on the fourth layer to obtain f4. And f1 is added with the next layer f2 through a pixel attention module, the number of channels after two-dimensional convolution is c, y1 is obtained through the pixel attention module, then the next layer f3 is connected with the next layer f1, the number of channels after two-dimensional convolution is c, y2 is obtained through the pixel attention module, then the next layer f4 is connected with the next layer f1, the number of channels after two-dimensional convolution is c, y3, y1, y2 and y3 are obtained through the pixel attention module, and the number of channels is changed from 3c to c through two-dimensional convolution after the channel connection, so that output y is obtained.
Further, the step S3 includes:
s31: loss function and parameter setting: the loss function uses mse mean square error, uses Adam optimizer, the batch size is set to 1, the learning rate is 0.00001, and the epoch is set to 1000;
s32: inputting the processed Gaussian image into a neural network for training;
s33: and loading the trained network parameters, testing the sizes of evaluation functions mae and mse by using a test set, and estimating the network performance.
Further, the step S4 includes:
s41: processing the picture acquired by the camera to be less than 768 by 1024 pixels to obtain a processed picture;
s42: and inputting the processed picture into a trained neural network to obtain a corresponding predicted density map and a predicted population y.
Compared with the prior art, the invention has the following advantages that the S1: acquiring a data set; s2: constructing a multistage attention scale sensing neural network; s3: debugging and training a multi-level attention scale sensing neural network and testing; s4: and acquiring a camera image, inputting a trained neural network for testing, and obtaining a predicted density map and the number of predicted persons of the image. By means of the method, the method and the device for detecting the crowd quantity can be suitable for crowd quantity detection in a large-scale scene, and accuracy of detection results is effectively improved.
Compared with the prior art, the invention has the beneficial effects that:
1: the invention can more accurately estimate the crowd density and the crowd quantity for a large-scale crowd;
2: the invention improves the structure of the classical convolutional neural network, replaces a simple convolutional network layer by the multi-level attention module and the custom scale perception module, optimizes the initial weight threshold of the neural network by using the Adam optimizer, accelerates the convergence rate of the network, is close to the optimal parameters of the network, and enhances the extraction of different characteristics by the network;
3: the invention further extracts the characteristic information of different spaces through the custom scale perception module on the basis of extracting the characteristics of the first ten layers of VGG16, improves the attention of the network to the dense crowd, and solves the problem that the single scale characteristic extraction is not comprehensive enough. The weight of the effective features under different scales is increased through multistage attention, the background weight is weakened, and the regression performance is improved.
Drawings
FIG. 1 is a schematic flow chart of a crowd counting method based on multi-level attention scale perception according to one embodiment of the invention;
FIG. 2 is a schematic diagram of a multi-level attention scale aware neural network structure according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a channel attention upsampling module (UGA) of one embodiment of the present invention;
FIG. 4 is a schematic diagram of a spatial attention upsampling module (USA) of one embodiment of the present invention;
FIG. 5 is a schematic diagram of a pixel attention upsampling module (UPA) of one embodiment of the present invention;
FIG. 6 is a schematic diagram of a scale-aware module structure according to an embodiment of this invention.
Detailed Description
In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.
As shown in fig. 1, the present invention provides a multi-level attention scale sensing crowd counting method, comprising:
a multi-level attention scale based crowd counting method, comprising:
s1: acquiring a data set and preprocessing to obtain a density map of a training set and a density map of a test set;
s2: constructing a backbone of a multi-level attention scale sensing neural network;
s3: debugging and training the multi-level attention scale aware neural network and testing the network effectiveness based on the training set, the testing set and the trunk of the multi-level attention scale aware neural network to obtain a trained neural network;
s4: and acquiring a camera image, inputting a trained neural network for testing, and obtaining a predicted density map and the number of predicted persons of the image.
The invention provides a method for adopting a multi-scale perception neural network, which can effectively extract the characteristics of crowds with different concentrations, and simultaneously concentrate the attention of the network to the areas with dense crowds in a single picture by using the attention of different scales, so as to solve the problem that the characteristics of the extracted characteristics of a single scale are not abundant, and strengthen the practical significance of the characteristic patterns of a plurality of layers for learning proper characteristic expression by the attention of different scales.
Further, the step S1 includes:
s11: downloading a public data set, and dividing the public data set into a training set and a testing set;
s12: the training set and the test set are subjected to data enhancement, the image is overturned left and right, and the data quantity is doubled, so that the image data of the training set and the test set of the test set are obtained respectively;
s13: supplementing the image data of the training set, the training set of the test set and the wide and high pixels of the test set to be multiples of 16, and adjusting the position of the positioning map proportionally to obtain the positioning map of the label of the training set and the positioning map of the label of the test set;
s14: and processing the positioning map of the labels of the training set of the labels into a density map of the training set by using a Gaussian kernel function with the Gaussian kernel size of 15, and processing the positioning map of the labels of the testing set of the labels into a density map of the testing set.
Further, the step S2 includes:
s21: design the structure of the encoder that extracts the features: taking the first ten layers of VGG16 as feature extraction layers, kernel=3, adopting Conv2d convolution, adding a Relu activation function after each convolution layer, wherein the layers are 64, 64, 128, maxpooling (kernel=2), 256, 256, 256, and maxpooling (kernel=2), 512, 512, 512, and extracting depth feature features by using the structure of the encoder, and loading VGG16 pre-training parameters;
s22: and designing a regression population density map and the number of people of the decoder.
Further, the step S22 includes:
s221: the back end backbone network is SA-UGA-SA-UPA-SA-USA, the scale perception module SA, the kernel is 3, the input network layer number is 512, the output network layer number is 128, and a Relu activation function is connected; a self-defined channel attention up-sampling module UGA; two-dimensional convolution, wherein kernel is 3, the number of input network layers is 128, the number of output network layers is 64, and a Relu activation function is connected; a self-defined spatial attention upsampling module USA; two-dimensional convolution, wherein kernel is 3, the number of input network layers is 64, the number of output network layers is 16, and a Relu activation function is connected; a custom pixel attention up-sampling module USA; the last layer is a full convolution network with the number of input network layers being 16, the number of output network layers being 1 and kernel being 1, and is connected with a Relu activation function, and in addition, a residual error structure is added between SA and SA, and finally a prediction density map is output;
s222: building a pixel attention module: carrying out two-dimensional convolution on an input image in, wherein an input channel is equal to an output channel, a kernel is 1, a sigmoid function is connected to process the input image in to obtain out, and finally, the input image in and the out are outputted as point multiplication and addition in, and a weight parameter is added for each pixel point in the mode, so that the precision is improved;
s223: constructing a custom multi-level attention module: the channel attention upsampling module (UGA) process is double upsampling, the sum of self-adaptive average pooling and self-adaptive maximum pooling is taken, and the output characteristic diagram of the module is obtained by a Sigmoid function and a characteristic diagram after the previous double upsampling; the spatial attention upsampling module (USA) processes are double upsampling, the average value and the maximum value of the channel layer characteristic diagrams are connected according to the channel, the convolution is 7 in the convolution kernel size, the expansion rate is 3, the characteristic diagrams after the previous double upsampling are added through the Sigmoid function, and the output characteristic diagrams of the module are obtained; the pixel attention up-sampling module (UPA) process is two-dimensional convolution with double up-sampling and convolution kernel of 3, two-dimensional convolution with convolution kernel of one, the output of the two-dimensional convolution with the weight after the output of the Sigmoid function and the convolution kernel of one, and the convolution with the two-dimensional convolution with the convolution kernel of one and the convolution kernel of 3 are adopted, so that the output characteristic diagram of the module is obtained.
S224: and constructing a custom scale perception module SA. The input is x, copy x channel number c; inputting four parallel different modes to extract features, wherein the size of a first layer convolution kernel is 3, and the cavity convolution condition is l to obtain f1; setting different cavity convolutions on the second layer to obtain f2; setting different cavity convolutions on a third layer to obtain f3; and setting different cavity convolutions on the fourth layer to obtain f4. And f1 is added with the next layer f2 through a pixel attention module, the number of channels after two-dimensional convolution is c, y1 is obtained through the pixel attention module, then the next layer f3 is connected with the next layer f1, the number of channels after two-dimensional convolution is c, y2 is obtained through the pixel attention module, then the next layer f4 is connected with the next layer f1, the number of channels after two-dimensional convolution is c, y3, y1, y2 and y3 are obtained through the pixel attention module, and the number of channels is changed from 3c to c through two-dimensional convolution after the channel connection, so that output y is obtained.
Further, the step S3 includes:
s31: loss function and parameter setting: the loss function uses mse mean square error, uses Adam optimizer, the batch size is set to 1, the learning rate is 0.00001, and the epoch is set to 1000;
s32: inputting the processed Gaussian image into a neural network for training;
s33: and loading the trained network parameters, testing the sizes of evaluation functions mae and mse by using a test set, and estimating the network performance.
Further, the step S4 includes:
s41: processing the picture acquired by the camera to be less than 768 by 1024 pixels to obtain a processed picture;
s42: and inputting the processed picture into a trained neural network to obtain a corresponding predicted density map and a predicted population y.
Specifically, as shown in fig. 1, the present invention provides a crowd counting method based on multi-level attention scale sensing, including:
s1: acquiring a data set and preprocessing;
s2: constructing a multi-level attention scale sensing neural network backbone;
s3: debugging and training a multi-level attention scale aware neural network and testing network effectiveness;
s4: and acquiring a camera image, inputting a trained neural network for testing, and obtaining a predicted density map and the number of predicted persons of the image.
As shown in fig. 2, the present invention provides a crowd counting method based on multi-level attention scale sensing, further describing structural details of a multi-level attention scale sensing neural network, including:
1: the front-end network extracts features. The first ten layers of VGG16 were taken as feature extraction layers, kernel=3, conv2d convolution was used, and each convolution layer was followed by a Relu activation function with layers 64, 64, 128, maxpooling (kernel=2), 256, 256, 256, maxpooling (kernel=2), 512, 512, 512. Depth feature is extracted with this structure.
2: and (5) back-end network design.
3: multistage attention scale sensing neural network
As shown in fig. 3 to 6, the present invention provides a crowd counting method based on multi-level attention scale sensing, further describing a multi-level attention scale sensing module thereof, including:
1: a pixel attention module is constructed. And carrying out two-dimensional convolution on the input image in, wherein an input channel is equal to an output channel, the kernel is 1, then, a sigmoid function is connected to process to obtain out, and finally, the point multiplication of the input image in and the point multiplication of the output image out are added in. In this way, a weight parameter is added to each pixel point, so that the precision is improved.
2: a custom multi-level attention module is constructed. The channel attention upsampling module (UGA) process is double upsampling, the sum of the adaptive average pooling and the adaptive maximum pooling is taken, and the output characteristic diagram of the module is obtained by adding the characteristic diagram after the previous double upsampling to the Sigmoid function. The spatial attention upsampling module (USA) processes are double upsampling, the average value and the maximum value of the channel layer characteristic diagrams are connected according to the channel, convolution with the convolution kernel size of 7 and the expansion rate of 3 is carried out, the characteristic diagrams after the previous double upsampling are added through a Sigmoid function, and the output characteristic diagram of the module is obtained. The pixel attention up-sampling module (UPA) process is two-dimensional convolution with double up-sampling and convolution kernel of 3, two-dimensional convolution with convolution kernel of one, the output of the two-dimensional convolution with the weight after the output of the Sigmoid function and the convolution kernel of one, and the convolution with the two-dimensional convolution with the convolution kernel of one and the convolution kernel of 3 are adopted, so that the output characteristic diagram of the module is obtained.
3: and constructing a custom scale perception module SA. The input is x, the number of x channels c is replicated. Inputting four parallel different modes to extract features, wherein the size of a first layer convolution kernel is 3, and the cavity convolution condition is 1 to obtain f1; setting different cavity convolutions on the second layer to obtain f2; setting different cavity convolutions on a third layer to obtain f3; and setting different cavity convolutions on the fourth layer to obtain f4. And f1 is added with the next layer f2 through a pixel attention module, the number of channels after two-dimensional convolution is c, y1 is obtained through the pixel attention module, then the next layer f3 is connected with the two-dimensional convolution, y2 is obtained through the pixel attention module, then the next layer f4 is connected with the two-dimensional convolution, the number of channels after two-dimensional convolution is c, and y3 is obtained through the pixel attention module. And (3) carrying out two-dimensional convolution on the y1, y2 and y3 after the channel connection to change the channel number from 3c to c, so as to obtain an output y.
As shown in fig. 2, the present invention provides a crowd counting method based on multi-level attention scale sensing, and the application part is further described. The method comprises the steps of obtaining image data by using a camera, processing the image data into 768 times 1024 pixels, processing the image data into RGB three-channel images if the image data are gray images, loading a trained network and parameters thereof, and inputting pictures to obtain the number of people to be predicted.
The invention can be used for people flow detection systems of large-scale gatherings, tourist sites, markets and the like with dense crowd, and can be used for predicting the number of people in the current picture by utilizing a single picture, and particularly, the invention is more accurate under the condition of dense number of people.
Compared with the prior art, the invention has the beneficial effects that:
1: the invention can more accurately estimate the crowd density and the crowd quantity for a large-scale crowd;
2: the invention improves the structure of the classical convolutional neural network, replaces a simple convolutional network layer by the multi-level attention module and the custom scale perception module, optimizes the initial weight threshold of the neural network by using the Adam optimizer, accelerates the convergence rate of the network, is close to the optimal parameters of the network, and enhances the extraction of different characteristics by the network;
3: the invention further extracts the characteristic information of different spaces through the custom scale perception module on the basis of extracting the characteristics of the first ten layers of VGG16, improves the attention of the network to the dense crowd, and solves the problem that the single scale characteristic extraction is not comprehensive enough. The weight of the effective features under different scales is increased through multistage attention, the background weight is weakened, and the regression performance is improved.
In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.
Claims (3)
1. A multi-level attention scale based crowd counting method, comprising:
s1: acquiring a data set and preprocessing to obtain a density map of a training set and a density map of a test set;
s2: constructing a backbone of a multi-level attention scale sensing neural network;
s3: debugging and training the multi-level attention scale aware neural network and testing the network effectiveness based on the training set, the testing set and the trunk of the multi-level attention scale aware neural network to obtain a trained neural network;
s4: acquiring a camera image, inputting a trained neural network for testing, and obtaining a predicted density map and the number of predicted persons of the image;
the step S1 includes:
s11: downloading a public data set, and dividing the public data set into a training set and a testing set;
s12: the training set and the test set are subjected to data enhancement, the image is overturned left and right, and the data quantity is doubled, so that the image data of the training set and the test set of the test set are obtained respectively;
s13: supplementing the image data of the training set, the training set of the test set and the wide and high pixels of the test set to be multiples of 16, and adjusting the position of the positioning map proportionally to obtain the positioning map of the label of the training set and the positioning map of the label of the test set;
s14: processing a positioning map of a label of a training set of the label into a density map of the training set by using a Gaussian kernel function with the Gaussian kernel size of 15, and processing a positioning map of a label of a testing set of the label into a density map of the testing set;
the step S2 includes:
s21: design the structure of the encoder that extracts the features: taking the first ten layers of VGG16 as feature extraction layers, kernel=3, adopting Conv2d convolution, adding a Relu activation function after each convolution layer, wherein the layers are 64, 64, 128, maxpooling, kernel=2, 256, 256, 256, maxpooling, kernel=2, 512, 512, 512, and extracting depth feature features by using the structure of the encoder, and loading VGG16 pre-training parameters;
s22: designing a regression population density map and the number of people of the decoder;
the step S22 includes:
s221: the back end backbone network is SA-UGA-SA-UPA-SA-USA, the scale perception module SA, the kernel is 3, the input network layer number is 512, the output network layer number is 128, and a Relu activation function is connected; a self-defined channel attention up-sampling module UGA; two-dimensional convolution, wherein kernel is 3, the number of input network layers is 128, the number of output network layers is 64, and a Relu activation function is connected; a self-defined spatial attention upsampling module USA; two-dimensional convolution, wherein kernel is 3, the number of input network layers is 64, the number of output network layers is 16, and a Relu activation function is connected; a custom pixel attention up-sampling module USA; the last layer is a full convolution network with the number of input network layers being 16, the number of output network layers being 1 and kernel being 1, and is connected with a Relu activation function, and in addition, a residual error structure is added between SA and SA, and finally a prediction density map is output;
s222: building a pixel attention module: carrying out two-dimensional convolution on an input image in, wherein an input channel is equal to an output channel, a kernel is 1, a sigmoid function is connected to process the input image in to obtain out, and finally, the input image in is output as a point multiplication sum in of in and out, and a weight parameter is added for each pixel point in the mode;
s223: constructing a custom multi-level attention module: the channel attention upsampling module (UGA) process is double upsampling, the sum of self-adaptive average pooling and self-adaptive maximum pooling is taken, and the output characteristic diagram of the module is obtained by a Sigmoid function and a characteristic diagram after the previous double upsampling; the spatial attention upsampling module (USA) processes are double upsampling, the average value and the maximum value of the channel layer characteristic diagrams are connected according to the channel, the convolution is 7 in the convolution kernel size, the expansion rate is 3, the characteristic diagrams after the previous double upsampling are added through the Sigmoid function, and the output characteristic diagrams of the module are obtained; the pixel attention up-sampling module (UPA) process is a two-dimensional convolution with double up-sampling and a convolution kernel of 3, a two-dimensional convolution with a convolution kernel of one, the weight output by the Sigmoid function is added with the output of the two-dimensional convolution with the convolution kernel of one, and a two-dimensional convolution with the convolution kernel of one and a convolution with the convolution kernel of 3 are adopted, so that an output characteristic diagram of the module is obtained;
s224: constructing a custom scale perception module SA; the input is x, copy x channel number c; inputting four parallel different modes to extract features, wherein the size of a first layer convolution kernel is 3, and the cavity convolution condition is 1 to obtain f1; setting different cavity convolutions on the second layer to obtain f2; setting different cavity convolutions on a third layer to obtain f3; setting different cavity convolutions on a fourth layer to obtain f4; and f1 is added with the next layer f2 through a pixel attention module, the number of channels after two-dimensional convolution is c, y1 is obtained through the pixel attention module, then the next layer f3 is connected with the next layer f1, the number of channels after two-dimensional convolution is c, y2 is obtained through the pixel attention module, then the next layer f4 is connected with the next layer f1, the number of channels after two-dimensional convolution is c, y3, y1, y2 and y3 are obtained through the pixel attention module, and the number of channels is changed from 3c to c through two-dimensional convolution after the channel connection, so that output y is obtained.
2. The method of claim 1, wherein the step S3 comprises:
s31: loss function and parameter setting: the loss function uses mse mean square error, uses Adam optimizer, the batch size is set to 1, the learning rate is 0.00001, and the epoch is set to 1000;
s32: inputting the processed Gaussian image into a neural network for training;
s33: and loading the trained network parameters, testing the sizes of evaluation functions mae and mse by using a test set, and estimating the network performance.
3. The method of claim 1, wherein the step S4 comprises:
s41: processing the picture acquired by the camera to be less than 768 by 1024 pixels to obtain a processed picture;
s42: and inputting the processed picture into a trained neural network to obtain a corresponding predicted density map and a predicted population y.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110605990.1A CN113283356B (en) | 2021-05-31 | 2021-05-31 | Multistage attention scale perception crowd counting method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110605990.1A CN113283356B (en) | 2021-05-31 | 2021-05-31 | Multistage attention scale perception crowd counting method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113283356A CN113283356A (en) | 2021-08-20 |
CN113283356B true CN113283356B (en) | 2024-04-05 |
Family
ID=77282919
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110605990.1A Active CN113283356B (en) | 2021-05-31 | 2021-05-31 | Multistage attention scale perception crowd counting method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113283356B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115880588A (en) * | 2021-09-13 | 2023-03-31 | 国家电网有限公司 | Two-stage unmanned aerial vehicle detection method combined with time domain |
CN114399728B (en) * | 2021-12-17 | 2023-12-05 | 燕山大学 | Foggy scene crowd counting method |
CN114511636B (en) * | 2022-04-20 | 2022-07-12 | 科大天工智能装备技术(天津)有限公司 | Fruit counting method and system based on double-filtering attention module |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110188685A (en) * | 2019-05-30 | 2019-08-30 | 燕山大学 | A kind of object count method and system based on the multiple dimensioned cascade network of double attentions |
WO2020169043A1 (en) * | 2019-02-21 | 2020-08-27 | 苏州大学 | Dense crowd counting method, apparatus and device, and storage medium |
CN112132023A (en) * | 2020-09-22 | 2020-12-25 | 上海应用技术大学 | Crowd counting method based on multi-scale context enhanced network |
CN112597964A (en) * | 2020-12-30 | 2021-04-02 | 上海应用技术大学 | Method for counting layered multi-scale crowd |
CN112668532A (en) * | 2021-01-05 | 2021-04-16 | 重庆大学 | Crowd counting method based on multi-stage mixed attention network |
-
2021
- 2021-05-31 CN CN202110605990.1A patent/CN113283356B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020169043A1 (en) * | 2019-02-21 | 2020-08-27 | 苏州大学 | Dense crowd counting method, apparatus and device, and storage medium |
CN110188685A (en) * | 2019-05-30 | 2019-08-30 | 燕山大学 | A kind of object count method and system based on the multiple dimensioned cascade network of double attentions |
CN112132023A (en) * | 2020-09-22 | 2020-12-25 | 上海应用技术大学 | Crowd counting method based on multi-scale context enhanced network |
CN112597964A (en) * | 2020-12-30 | 2021-04-02 | 上海应用技术大学 | Method for counting layered multi-scale crowd |
CN112668532A (en) * | 2021-01-05 | 2021-04-16 | 重庆大学 | Crowd counting method based on multi-stage mixed attention network |
Non-Patent Citations (1)
Title |
---|
基于通道域注意力机制的人群密度估计算法研究;马骞;;电子设计工程(15);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN113283356A (en) | 2021-08-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113283356B (en) | Multistage attention scale perception crowd counting method | |
CN108256562B (en) | Salient target detection method and system based on weak supervision time-space cascade neural network | |
CN112396607B (en) | Deformable convolution fusion enhanced street view image semantic segmentation method | |
CN111738054B (en) | Behavior anomaly detection method based on space-time self-encoder network and space-time CNN | |
CN112861690A (en) | Multi-method fused remote sensing image change detection method and system | |
CN110826428A (en) | Ship detection method in high-speed SAR image | |
CN112597964B (en) | Method for counting layered multi-scale crowd | |
CN110827265B (en) | Image anomaly detection method based on deep learning | |
CN110020658B (en) | Salient object detection method based on multitask deep learning | |
CN111062381B (en) | License plate position detection method based on deep learning | |
CN113888547A (en) | Non-supervision domain self-adaptive remote sensing road semantic segmentation method based on GAN network | |
Cho et al. | Semantic segmentation with low light images by modified CycleGAN-based image enhancement | |
CN116152591B (en) | Model training method, infrared small target detection method and device and electronic equipment | |
CN115035371A (en) | Borehole wall crack identification method based on multi-scale feature fusion neural network | |
CN114663665A (en) | Gradient-based confrontation sample generation method and system | |
CN112132867B (en) | Remote sensing image change detection method and device | |
CN111753714B (en) | Multidirectional natural scene text detection method based on character segmentation | |
CN111626197B (en) | Recognition method based on human behavior recognition network model | |
CN111401209B (en) | Action recognition method based on deep learning | |
CN116403152A (en) | Crowd density estimation method based on spatial context learning network | |
CN115953736A (en) | Crowd density estimation method based on video monitoring and deep neural network | |
CN112215241B (en) | Image feature extraction device based on small sample learning | |
CN115578624A (en) | Agricultural disease and pest model construction method, detection method and device | |
CN113205078B (en) | Crowd counting method based on multi-branch progressive attention-strengthening | |
CN115346115A (en) | Image target detection method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |