CN113283356A - Multi-level attention scale perception crowd counting method - Google Patents

Multi-level attention scale perception crowd counting method Download PDF

Info

Publication number
CN113283356A
CN113283356A CN202110605990.1A CN202110605990A CN113283356A CN 113283356 A CN113283356 A CN 113283356A CN 202110605990 A CN202110605990 A CN 202110605990A CN 113283356 A CN113283356 A CN 113283356A
Authority
CN
China
Prior art keywords
convolution
module
kernel
attention
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110605990.1A
Other languages
Chinese (zh)
Other versions
CN113283356B (en
Inventor
祝鲁宁
黄良军
沈世晖
张亚妮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Institute of Technology
Original Assignee
Shanghai Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Institute of Technology filed Critical Shanghai Institute of Technology
Priority to CN202110605990.1A priority Critical patent/CN113283356B/en
Publication of CN113283356A publication Critical patent/CN113283356A/en
Application granted granted Critical
Publication of CN113283356B publication Critical patent/CN113283356B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • G06V20/53Recognition of crowd images, e.g. recognition of crowd congestion

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a multi-level attention scale perception crowd counting method, and belongs to application of deep learning in computer vision. The method comprises the following specific steps: s1: acquiring a data set; s2: constructing a multi-level attention scale perception neural network; s3: debugging and training a multi-level attention scale perception neural network and testing; s4: and acquiring a camera image, inputting the trained neural network for testing, and acquiring a predicted density map and the predicted number of people of the image. By the mode, the method can be suitable for detecting the number of people in a large-scale scene, and the accuracy of the detection result is effectively improved.

Description

Multi-level attention scale perception crowd counting method
Technical Field
The invention relates to a multi-level attention scale perception crowd counting method.
Background
With the acceleration of the national urbanization pace and the rapid development of urban economy, the crowd meeting scenes are increased, the number of tourists is increased, and meanwhile, the potential safety hazard is also accompanied. Therefore, by designing a crowd counting method, the number of crowds is predicted, the early warning is carried out on a highly crowded scene, related personnel can be helped to carry out early warning and decision-making before and after an emergency, the life and property safety of people can be guaranteed, and dangerous events are avoided.
The existing population counts are mainly divided into two types: 1) methods based on traditional methods, such as support vector machines, decision trees, etc.; 2) deep learning-based methods, such as the channel and collateral methods of Networks such as MCNN and CSRNet. The population counting method based on deep learning has certain limitation. The method 1) adopts the traditional method, and has high complexity and poor precision; the method 2) uses the existing neural network, and has the problems of low precision and the like.
Disclosure of Invention
The invention aims to provide a multi-level attention scale perception crowd counting method.
In order to solve the above problems, the present invention provides a multi-level attention scale perception crowd counting method, comprising:
s1: acquiring a data set and preprocessing the data set to obtain a density map of a training set and a density map of a testing set;
s2: constructing a backbone of a multi-level attention scale perception neural network;
s3: debugging and training the multi-level attention scale perception neural network and testing the effectiveness of the network based on the training set, the testing set and the backbone of the multi-level attention scale perception neural network to obtain the trained neural network;
s4: and acquiring a camera image, inputting the trained neural network for testing, and acquiring a predicted density map and the predicted number of people of the image.
Further, the step S1 includes:
s11: downloading a public data set, and dividing the public data set into a training set and a testing set;
s12: carrying out data enhancement on the training set and the test set, turning the image left and right, and increasing the data volume by one time to respectively obtain image data of the training set and the test set of the test set;
s13: supplementing the image data of the training set and the width and height pixels of the test set into multiples of 16, and proportionally adjusting the position of a positioning diagram to obtain a positioning diagram of a label of the training set and a positioning diagram of a label of the test set;
s14: and processing the positioning graph of the labels of the training set of the labels into a density graph of the training set by using a Gaussian kernel function with the Gaussian kernel size of 15, and processing the positioning graph of the labels of the test set of the labels into a density graph of the test set.
Further, the step S2 includes:
s21: designing the structure of the encoder for extracting features: taking the first ten layers of the VGG16 as feature extraction layers, kernel ═ 3, performing Conv2d convolution, adding a Relu activation function after each convolution layer, wherein the number of the layers is 64, 64, 128, 128, maxporoling (kernel ═ 2), 256, 256, maxporoling (kernel ═ 2), 512, 512, 512, extracting a depth feature by using the structure of the encoder, and loading VGG16 pre-training parameters;
s22: and designing a regression population density graph and the number of people of the decoder.
Further, the step S22 includes:
s221: the back-end backbone network is SA-UGA-SA-UPA-SA-USA, the scale perception module SA and kernel are 3, the number of input network layers is 512, the number of output network layers is 128, and then a Relu activation function is connected; a custom channel attention upsampling module UGA; two-dimensional convolution, wherein kernel is 3, the number of input network layers is 128, the number of output network layers is 64, and then a Relu activation function is connected; a custom spatial attention upsampling module USA; two-dimensional convolution, wherein kemel is 3, the number of input network layers is 64, the number of output network layers is 16, and then a Relu activation function is connected; a custom pixel attention upsampling module USA; the last layer is a full convolution network with 16 input network layers, 1 output network layer and 1 kernel, and is connected with a Relu activation function, and a residual error structure is added between SA and SA to finally output a prediction density graph;
s222: constructing a pixel attention module: performing two-dimensional convolution on an input image in, wherein an input channel is equal to an output channel, kernel is 1, then performing sigmoid function processing to obtain out, and finally outputting a point product plus in of in and out;
s223: constructing a custom multi-level attention module: the channel attention upsampling module (UGA) flow is twice upsampling, the sum of self-adaptive average pooling and self-adaptive maximum pooling is taken, and the feature graph obtained by twice upsampling is added to the sum of the self-adaptive average pooling and the self-adaptive maximum pooling through a Sigmoid function, so that the feature graph is the output feature graph of the module; the spatial attention upsampling module (USA) has twice upsampling flow, and is connected with the average value and the maximum value of the channel layer surface characteristic graph according to the channel, and the characteristic graph after twice upsampling is the output characteristic graph of the module after the convolution with the convolution kernel size of 7 and the expansion rate of 3 and the Sigmoid function are added; the pixel attention upsampling module (UPA) flow is two-dimensional convolution with a convolution kernel of 3 through double upsampling, two-dimensional convolution with a convolution kernel of one, output of the two-dimensional convolution with a convolution kernel of one and weight after Sigmoid function output, and convolution with a convolution kernel of one and a convolution kernel of 3, namely the output characteristic diagram of the module.
S224: and constructing a custom dimension perception module SA. The input is x, and the number c of x channels is copied; inputting four parallel different modes to extract features, wherein the size of a first layer convolution kernel is 3, and the cavity convolution variance is 1, so that f1 is obtained; setting different hole convolutions in the second layer to obtain f 2; setting different hole convolutions on the third layer to obtain f 3; the fourth layer sets up a different hole convolution to yield f 4. Adding f1 with the f2 of the next layer through a pixel attention module, wherein the number of channels after two-dimensional convolution is c, obtaining y1 through the pixel attention module, connecting with the f3 of the next layer, the number of channels after two-dimensional convolution is c, obtaining y2 through the pixel attention module, connecting with the f4 of the next layer, the number of channels after two-dimensional convolution is c, obtaining y3, y1, y2 and y3 through the pixel attention module, connecting according to the channels, and then performing two-dimensional convolution to change the number of channels from 3c to c, and obtaining output y.
Further, the step S3 includes:
s31: setting a loss function and parameters: the loss function uses mse mean square error, uses Adam optimizer, the blocksize is set to 1, the learning rate is 0.00001, the epoch is set to 1000;
s32: inputting the processed Gaussian image into a neural network for training;
s33: and loading the trained network parameters, testing the evaluation functions mae and mse by using the test set, and estimating the network performance.
Further, the step S4 includes:
s41: processing the picture acquired by the camera to be smaller than 768 times 1024 pixels to obtain a processed picture;
s42: and inputting the processed pictures into the trained neural network to obtain a corresponding predicted density map and the number y of the predicted persons.
Compared with the prior art, the method has the advantages that through S1: acquiring a data set; s2: constructing a multi-level attention scale perception neural network; s3: debugging and training a multi-level attention scale perception neural network and testing; s4: and acquiring a camera image, inputting the trained neural network for testing, and acquiring a predicted density map and the predicted number of people of the image. By the mode, the method can be suitable for detecting the number of people in a large-scale scene, and the accuracy of the detection result is effectively improved.
Compared with the prior art, the invention has the beneficial effects that:
1: the invention can carry out more accurate crowd density and quantity estimation on large-scale crowd;
2: the structure of a classical convolutional neural network is improved, a simple convolutional network layer is replaced by a multi-level attention module and a custom scale perception module, an Adam optimizer is used for optimizing an initial weight threshold of the neural network, the convergence speed of the network is accelerated, the network is close to the optimal parameters of the network, and the extraction of different characteristics by the network is enhanced;
3: according to the invention, on the basis of extracting the characteristics of the front ten layers of VGG16, the characteristic information of different spaces is further extracted through a user-defined scale sensing module, the attention of a network to dense people is improved, and the problem that the extraction of the characteristics of a single scale is not comprehensive enough is solved. The weights of the effective features under different scales are increased through multi-level attention, the background weight is weakened, and the regression performance is improved.
Drawings
FIG. 1 is a schematic flow chart structure diagram of a crowd counting method based on multi-level attention scale perception according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a multi-level attention-scale aware neural network architecture according to one embodiment of the present invention;
FIG. 3 is a schematic diagram of a channel attention upsampling module (UGA) of one embodiment of the present invention;
FIG. 4 is a schematic diagram of a spatial attention upsampling module (USA) of one embodiment of the present invention;
FIG. 5 is a schematic diagram of a pixel attention upsampling module (UPA) according to one embodiment of the present invention;
FIG. 6 is a schematic diagram of a scale-aware module architecture according to one embodiment of the present invention.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
As shown in fig. 1, the present invention provides a multi-level attention scale perception population counting method, comprising:
a crowd counting method based on multi-level attention scale perception is characterized by comprising the following steps:
s1: acquiring a data set and preprocessing the data set to obtain a density map of a training set and a density map of a testing set;
s2: constructing a backbone of a multi-level attention scale perception neural network;
s3: debugging and training the multi-level attention scale perception neural network and testing the effectiveness of the network based on the training set, the testing set and the backbone of the multi-level attention scale perception neural network to obtain the trained neural network;
s4: and acquiring a camera image, inputting the trained neural network for testing, and acquiring a predicted density map and the predicted number of people of the image.
The invention provides a method for perceiving a neural network by adopting a multi-scale, which can effectively extract the characteristics of crowds with different densities, and meanwhile, the attention of the network is concentrated to the region with dense crowds in a single picture by utilizing the attention of different scales, so that the problem that the characteristics of the characteristics extracted by a single scale are not rich is solved, and the practical significance of the characteristic maps of multiple levels on learning proper characteristic expression is enhanced by the attention of different scales.
Further, the step S1 includes:
s11: downloading a public data set, and dividing the public data set into a training set and a testing set;
s12: carrying out data enhancement on the training set and the test set, turning the image left and right, and increasing the data volume by one time to respectively obtain image data of the training set and the test set of the test set;
s13: supplementing the image data of the training set and the width and height pixels of the test set into multiples of 16, and proportionally adjusting the position of a positioning diagram to obtain a positioning diagram of a label of the training set and a positioning diagram of a label of the test set;
s14: and processing the positioning graph of the labels of the training set of the labels into a density graph of the training set by using a Gaussian kernel function with the Gaussian kernel size of 15, and processing the positioning graph of the labels of the test set of the labels into a density graph of the test set.
Further, the step S2 includes:
s21: designing the structure of the encoder for extracting features: taking the first ten layers of the VGG16 as feature extraction layers, kernel ═ 3, performing Conv2d convolution, adding a Relu activation function after each convolution layer, wherein the number of the layers is 64, 64, 128, 128, maxporoling (kernel ═ 2), 256, 256, maxporoling (kernel ═ 2), 512, 512, 512, extracting a depth feature by using the structure of the encoder, and loading VGG16 pre-training parameters;
s22: and designing a regression population density graph and the number of people of the decoder.
Further, the step S22 includes:
s221: the back-end backbone network is SA-UGA-SA-UPA-SA-USA, the scale perception module SA and kernel are 3, the number of input network layers is 512, the number of output network layers is 128, and then a Relu activation function is connected; a custom channel attention upsampling module UGA; two-dimensional convolution, wherein kernel is 3, the number of input network layers is 128, the number of output network layers is 64, and then a Relu activation function is connected; a custom spatial attention upsampling module USA; two-dimensional convolution, wherein kernel is 3, the number of input network layers is 64, the number of output network layers is 16, and then a Relu activation function is connected; a custom pixel attention upsampling module USA; the last layer is a full convolution network with 16 input network layers, 1 output network layer and 1 kernel, and is connected with a Relu activation function, and a residual error structure is added between SA and SA to finally output a prediction density graph;
s222: constructing a pixel attention module: performing two-dimensional convolution on an input image in, wherein an input channel is equal to an output channel, kernel is 1, then performing sigmoid function processing to obtain out, and finally outputting a point product plus in of in and out;
s223: constructing a custom multi-level attention module: the channel attention upsampling module (UGA) flow is twice upsampling, the sum of self-adaptive average pooling and self-adaptive maximum pooling is taken, and the feature graph obtained by twice upsampling is added to the sum of the self-adaptive average pooling and the self-adaptive maximum pooling through a Sigmoid function, so that the feature graph is the output feature graph of the module; the spatial attention upsampling module (USA) has twice upsampling flow, and is connected with the average value and the maximum value of the channel layer surface characteristic graph according to the channel, and the characteristic graph after twice upsampling is the output characteristic graph of the module after the convolution with the convolution kernel size of 7 and the expansion rate of 3 and the Sigmoid function are added; the pixel attention upsampling module (UPA) flow is two-dimensional convolution with a convolution kernel of 3 through double upsampling, two-dimensional convolution with a convolution kernel of one, output of the two-dimensional convolution with a convolution kernel of one and weight after Sigmoid function output, and convolution with a convolution kernel of one and a convolution kernel of 3, namely the output characteristic diagram of the module.
S224: and constructing a custom dimension perception module SA. The input is x, and the number c of x channels is copied; inputting four parallel different modes to extract features, wherein the size of a first layer convolution kernel is 3, and the cavity convolution variance is l, so that f1 is obtained; setting different hole convolutions in the second layer to obtain f 2; setting different hole convolutions on the third layer to obtain f 3; the fourth layer sets up a different hole convolution to yield f 4. Adding f1 with the f2 of the next layer through a pixel attention module, wherein the number of channels after two-dimensional convolution is c, obtaining y1 through the pixel attention module, connecting with the f3 of the next layer, the number of channels after two-dimensional convolution is c, obtaining y2 through the pixel attention module, connecting with the f4 of the next layer, the number of channels after two-dimensional convolution is c, obtaining y3, y1, y2 and y3 through the pixel attention module, connecting according to the channels, and then performing two-dimensional convolution to change the number of channels from 3c to c, and obtaining output y.
Further, the step S3 includes:
s31: setting a loss function and parameters: the loss function uses mse mean square error, uses Adam optimizer, the blocksize is set to 1, the learning rate is 0.00001, the epoch is set to 1000;
s32: inputting the processed Gaussian image into a neural network for training;
s33: and loading the trained network parameters, testing the evaluation functions mae and mse by using the test set, and estimating the network performance.
Further, the step S4 includes:
s41: processing the picture acquired by the camera to be smaller than 768 times 1024 pixels to obtain a processed picture;
s42: and inputting the processed pictures into the trained neural network to obtain a corresponding predicted density map and the number y of the predicted persons.
Specifically, as shown in fig. 1, the present invention provides a crowd counting method based on multi-level attention scale perception, including:
s1: acquiring a data set and preprocessing the data set;
s2: constructing a multi-level attention scale perception neural network backbone;
s3: debugging and training a multi-level attention scale perception neural network and testing the effectiveness of the network;
s4: and acquiring a camera image, inputting the trained neural network for testing, and acquiring a predicted density map and the predicted number of people of the image.
As shown in fig. 2, the present invention provides a crowd counting method based on multi-level attention scale perception, further elaborating the details of the multi-level attention scale perception neural network structure, including:
1: the front-end network extracts features. The first ten layers of VGG16 are used as feature extraction layers, kernel ═ 3, Conv2d convolution is adopted, and a Relu activation function is added after each convolution layer, wherein the number of layers is 64, 64, 128, 128, maxporoling (kernel ═ 2), 256, 256, 256, maxporoling (kernel ═ 2), 512, 512, 512. Depth features feature are extracted with this structure.
2: back-end network design.
3: multi-level attention scale aware neural network
As shown in fig. 3 to 6, the present invention provides a crowd counting method based on multi-level attention scale perception, further elaborating the multi-level attention scale perception module therein, including:
1: a pixel attention module is constructed. And performing two-dimensional convolution on the input image in, wherein the input channel is equal to the output channel, kernel is 1, then performing sigmoid function processing to obtain out, and finally outputting the out as the point product plus in of in and out. In this way, a weight parameter is added to each pixel point, so that the precision is improved.
2: and constructing a custom multi-level attention module. The flow of the channel attention upsampling module (UGA) is twice upsampling, the sum of self-adaptive average pooling and self-adaptive maximum pooling is taken, and the feature graph after twice upsampling is added through a Sigmoid function, so that the output feature graph of the module is obtained. The spatial attention upsampling module (USA) has twice upsampling flow, and is an output characteristic diagram of the module by connecting the average value and the maximum value of the channel layer surface characteristic diagram according to the channel, performing convolution with the convolution kernel size of 7 and the expansion rate of 3, performing Sigmoid function and adding the characteristic diagram obtained by twice upsampling. The pixel attention upsampling module (UPA) flow is two-dimensional convolution with a convolution kernel of 3 through double upsampling, two-dimensional convolution with a convolution kernel of one, output of the two-dimensional convolution with a convolution kernel of one and weight after Sigmoid function output, and convolution with a convolution kernel of one and a convolution kernel of 3, namely the output characteristic diagram of the module.
3: and constructing a custom dimension perception module SA. The input is x, and x channels are copied by c. Inputting four parallel different modes to extract features, wherein the size of a first layer convolution kernel is 3, and the cavity convolution variance is 1, so that f1 is obtained; setting different hole convolutions in the second layer to obtain f 2; setting different hole convolutions on the third layer to obtain f 3; the fourth layer sets up a different hole convolution to yield f 4. Adding f1 to the next layer f2 through a pixel attention module, wherein the number of channels after two-dimensional convolution is c, obtaining y1 through the pixel attention module, connecting with the next layer f3, the number of channels after two-dimensional convolution is c, obtaining y2 through the pixel attention module, connecting with the next layer f4, the number of channels after two-dimensional convolution is c, and obtaining y3 through the pixel attention module. And y1, y2 and y3 are connected according to channels, and then the two-dimensional convolution changes the number of the channels from 3c to obtain output y.
As shown in fig. 2, the present invention provides a people counting method based on multi-level attention scale perception, and further describes the application part thereof. The method comprises the steps of acquiring image data by using a camera, processing the image data into 768-1024 pixels, processing the image data into an RGB three-channel image if the image data is a gray image, loading a trained network and parameters thereof, and inputting the image to obtain the number of people to be predicted.
The invention can be used for the people flow detection system of large-scale gathering places and densely populated tourist places, shopping malls and the like, and the invention predicts the number of people in the current picture by using the single picture, and is more accurate especially under the condition of densely populated number.
Compared with the prior art, the invention has the beneficial effects that:
1: the invention can carry out more accurate crowd density and quantity estimation on large-scale crowd;
2: the structure of a classical convolutional neural network is improved, a simple convolutional network layer is replaced by a multi-level attention module and a custom scale perception module, an Adam optimizer is used for optimizing an initial weight threshold of the neural network, the convergence speed of the network is accelerated, the network is close to the optimal parameters of the network, and the extraction of different characteristics by the network is enhanced;
3: according to the invention, on the basis of extracting the characteristics of the front ten layers of VGG16, the characteristic information of different spaces is further extracted through a user-defined scale sensing module, the attention of a network to dense people is improved, and the problem that the extraction of the characteristics of a single scale is not comprehensive enough is solved. The weights of the effective features under different scales are increased through multi-level attention, the background weight is weakened, and the regression performance is improved.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
It will be apparent to those skilled in the art that various changes and modifications may be made in the invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (6)

1. A crowd counting method based on multi-level attention scale perception is characterized by comprising the following steps:
s1: acquiring a data set and preprocessing the data set to obtain a density map of a training set and a density map of a testing set;
s2: constructing a backbone of a multi-level attention scale perception neural network;
s3: debugging and training the multi-level attention scale perception neural network and testing the effectiveness of the network based on the training set, the testing set and the backbone of the multi-level attention scale perception neural network to obtain the trained neural network;
s4: and acquiring a camera image, inputting the trained neural network for testing, and acquiring a predicted density map and the predicted number of people of the image.
2. The method for people counting based on multi-level attention scale perception according to claim 1, wherein the step S1 comprises:
s11: downloading a public data set, and dividing the public data set into a training set and a testing set;
s12: carrying out data enhancement on the training set and the test set, turning the image left and right, and increasing the data volume by one time to respectively obtain image data of the training set and the test set of the test set;
s13: supplementing the image data of the training set and the width and height pixels of the test set into multiples of 16, and proportionally adjusting the position of a positioning diagram to obtain a positioning diagram of a label of the training set and a positioning diagram of a label of the test set;
s14: and processing the positioning graph of the labels of the training set of the labels into a density graph of the training set by using a Gaussian kernel function with the Gaussian kernel size of 15, and processing the positioning graph of the labels of the test set of the labels into a density graph of the test set.
3. The method for people counting based on multi-level attention scale perception according to claim 1, wherein the step S2 comprises:
s21: designing the structure of the encoder for extracting features: taking the first ten layers of the VGG16 as feature extraction layers, kernel ═ 3, performing Conv2d convolution, adding a Relu activation function after each convolution layer, wherein the number of the layers is 64, 64, 128, 128, maxporoling (kernel ═ 2), 256, 256, maxporoling (kernel ═ 2), 512, 512, 512, extracting a depth feature by using the structure of the encoder, and loading VGG16 pre-training parameters;
s22: and designing a regression population density graph and the number of people of the decoder.
4. The method according to claim 3, wherein the step S22 comprises:
s221: the back-end backbone network is SA-UGA-SA-UPA-SA-USA, the scale perception module SA and kernel are 3, the number of input network layers is 512, the number of output network layers is 128, and then a Relu activation function is connected; a custom channel attention upsampling module UGA; two-dimensional convolution, wherein kernel is 3, the number of input network layers is 128, the number of output network layers is 64, and then a Relu activation function is connected; a custom spatial attention upsampling module USA; two-dimensional convolution, wherein kernel is 3, the number of input network layers is 64, the number of output network layers is 16, and then a Relu activation function is connected; a custom pixel attention upsampling module USA; the last layer is a full convolution network with 16 input network layers, 1 output network layer and 1 kernel, and is connected with a Relu activation function, and a residual error structure is added between SA and SA to finally output a prediction density graph;
s222: constructing a pixel attention module: performing two-dimensional convolution on an input image in, wherein an input channel is equal to an output channel, kernel is 1, then performing sigmoid function processing to obtain out, and finally outputting a point product plus in of in and out, and adding a weight parameter for each pixel point in this way;
s223: constructing a custom multi-level attention module: the channel attention upsampling module (UGA) flow is twice upsampling, the sum of self-adaptive average pooling and self-adaptive maximum pooling is taken, and the feature graph obtained by twice upsampling is added to the sum of the self-adaptive average pooling and the self-adaptive maximum pooling through a Sigmoid function, so that the feature graph is the output feature graph of the module; the spatial attention upsampling module (USA) has twice upsampling flow, and is connected with the average value and the maximum value of the channel layer surface characteristic graph according to the channel, and the characteristic graph after twice upsampling is the output characteristic graph of the module after the convolution with the convolution kernel size of 7 and the expansion rate of 3 and the Sigmoid function are added; the pixel attention upsampling module (UPA) flow is two-dimensional convolution with a convolution kernel of 3 through double upsampling, two-dimensional convolution with a convolution kernel of one, output of the two-dimensional convolution with a convolution kernel of one and weight after Sigmoid function output, and convolution with a convolution kernel of one and a convolution kernel of 3, namely the output characteristic diagram of the module.
S224: and constructing a custom dimension perception module SA. The input is x, and the number c of x channels is copied; inputting four parallel different modes to extract features, wherein the size of a first layer convolution kernel is 3, and the cavity convolution variance is 1, so that f1 is obtained; setting different hole convolutions in the second layer to obtain f 2; setting different hole convolutions on the third layer to obtain f 3; the fourth layer sets up a different hole convolution to yield f 4. Adding f1 with the f2 of the next layer through a pixel attention module, wherein the number of channels after two-dimensional convolution is c, obtaining y1 through the pixel attention module, connecting with the f3 of the next layer, the number of channels after two-dimensional convolution is c, obtaining y2 through the pixel attention module, connecting with the f4 of the next layer, the number of channels after two-dimensional convolution is c, obtaining y3, y1, y2 and y3 through the pixel attention module, connecting according to the channels, and then performing two-dimensional convolution to change the number of channels from 3c to c, and obtaining output y.
5. The method for people counting based on multi-level attention scale perception according to claim 1, wherein the step S3 comprises:
s31: setting a loss function and parameters: the loss function uses mse mean square error, uses Adam optimizer, the blocksize is set to 1, the learning rate is 0.00001, the epoch is set to 1000;
s32: inputting the processed Gaussian image into a neural network for training;
s33: and loading the trained network parameters, testing the evaluation functions mae and mse by using the test set, and estimating the network performance.
6. The method for people counting based on multi-level attention scale perception according to claim 1, wherein the step S4 comprises:
s41: processing the picture acquired by the camera to be smaller than 768 times 1024 pixels to obtain a processed picture;
s42: and inputting the processed pictures into the trained neural network to obtain a corresponding predicted density map and the number y of the predicted persons.
CN202110605990.1A 2021-05-31 2021-05-31 Multistage attention scale perception crowd counting method Active CN113283356B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110605990.1A CN113283356B (en) 2021-05-31 2021-05-31 Multistage attention scale perception crowd counting method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110605990.1A CN113283356B (en) 2021-05-31 2021-05-31 Multistage attention scale perception crowd counting method

Publications (2)

Publication Number Publication Date
CN113283356A true CN113283356A (en) 2021-08-20
CN113283356B CN113283356B (en) 2024-04-05

Family

ID=77282919

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110605990.1A Active CN113283356B (en) 2021-05-31 2021-05-31 Multistage attention scale perception crowd counting method

Country Status (1)

Country Link
CN (1) CN113283356B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114399728A (en) * 2021-12-17 2022-04-26 燕山大学 Method for counting crowds in foggy day scene
CN114511636A (en) * 2022-04-20 2022-05-17 科大天工智能装备技术(天津)有限公司 Fruit counting method and system based on double-filtering attention module
CN115880588A (en) * 2021-09-13 2023-03-31 国家电网有限公司 Two-stage unmanned aerial vehicle detection method combined with time domain
CN117253184A (en) * 2023-08-25 2023-12-19 燕山大学 Foggy day image crowd counting method guided by foggy priori frequency domain attention characterization

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110188685A (en) * 2019-05-30 2019-08-30 燕山大学 A kind of object count method and system based on the multiple dimensioned cascade network of double attentions
WO2020169043A1 (en) * 2019-02-21 2020-08-27 苏州大学 Dense crowd counting method, apparatus and device, and storage medium
CN112132023A (en) * 2020-09-22 2020-12-25 上海应用技术大学 Crowd counting method based on multi-scale context enhanced network
CN112597964A (en) * 2020-12-30 2021-04-02 上海应用技术大学 Method for counting layered multi-scale crowd
CN112668532A (en) * 2021-01-05 2021-04-16 重庆大学 Crowd counting method based on multi-stage mixed attention network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020169043A1 (en) * 2019-02-21 2020-08-27 苏州大学 Dense crowd counting method, apparatus and device, and storage medium
CN110188685A (en) * 2019-05-30 2019-08-30 燕山大学 A kind of object count method and system based on the multiple dimensioned cascade network of double attentions
CN112132023A (en) * 2020-09-22 2020-12-25 上海应用技术大学 Crowd counting method based on multi-scale context enhanced network
CN112597964A (en) * 2020-12-30 2021-04-02 上海应用技术大学 Method for counting layered multi-scale crowd
CN112668532A (en) * 2021-01-05 2021-04-16 重庆大学 Crowd counting method based on multi-stage mixed attention network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
马骞;: "基于通道域注意力机制的人群密度估计算法研究", 电子设计工程, no. 15 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115880588A (en) * 2021-09-13 2023-03-31 国家电网有限公司 Two-stage unmanned aerial vehicle detection method combined with time domain
CN114399728A (en) * 2021-12-17 2022-04-26 燕山大学 Method for counting crowds in foggy day scene
CN114399728B (en) * 2021-12-17 2023-12-05 燕山大学 Foggy scene crowd counting method
CN114511636A (en) * 2022-04-20 2022-05-17 科大天工智能装备技术(天津)有限公司 Fruit counting method and system based on double-filtering attention module
CN117253184A (en) * 2023-08-25 2023-12-19 燕山大学 Foggy day image crowd counting method guided by foggy priori frequency domain attention characterization
CN117253184B (en) * 2023-08-25 2024-05-17 燕山大学 Foggy day image crowd counting method guided by foggy priori frequency domain attention characterization

Also Published As

Publication number Publication date
CN113283356B (en) 2024-04-05

Similar Documents

Publication Publication Date Title
CN108256562B (en) Salient target detection method and system based on weak supervision time-space cascade neural network
CN113283356A (en) Multi-level attention scale perception crowd counting method
CN110210551B (en) Visual target tracking method based on adaptive subject sensitivity
CN110119703B (en) Human body action recognition method fusing attention mechanism and spatio-temporal graph convolutional neural network in security scene
CN108764085B (en) Crowd counting method based on generation of confrontation network
CN112396607B (en) Deformable convolution fusion enhanced street view image semantic segmentation method
CN112597964B (en) Method for counting layered multi-scale crowd
CN111723693B (en) Crowd counting method based on small sample learning
CN108805002B (en) Monitoring video abnormal event detection method based on deep learning and dynamic clustering
CN111738054B (en) Behavior anomaly detection method based on space-time self-encoder network and space-time CNN
CN110084201B (en) Human body action recognition method based on convolutional neural network of specific target tracking in monitoring scene
WO2019136591A1 (en) Salient object detection method and system for weak supervision-based spatio-temporal cascade neural network
CN107506792B (en) Semi-supervised salient object detection method
CN110827265B (en) Image anomaly detection method based on deep learning
Hu et al. Parallel spatial-temporal convolutional neural networks for anomaly detection and location in crowded scenes
CN114663665A (en) Gradient-based confrontation sample generation method and system
CN112215241B (en) Image feature extraction device based on small sample learning
CN113936235A (en) Video saliency target detection method based on quality evaluation
CN112132867B (en) Remote sensing image change detection method and device
CN111753714B (en) Multidirectional natural scene text detection method based on character segmentation
CN111428809B (en) Crowd counting method based on spatial information fusion and convolutional neural network
CN111401209B (en) Action recognition method based on deep learning
CN116403152A (en) Crowd density estimation method based on spatial context learning network
CN114494999B (en) Double-branch combined target intensive prediction method and system
CN110363792A (en) A kind of method for detecting change of remote sensing image based on illumination invariant feature extraction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant