CN115457464B - Crowd counting method based on transformer and CNN - Google Patents
Crowd counting method based on transformer and CNN Download PDFInfo
- Publication number
- CN115457464B CN115457464B CN202211084706.1A CN202211084706A CN115457464B CN 115457464 B CN115457464 B CN 115457464B CN 202211084706 A CN202211084706 A CN 202211084706A CN 115457464 B CN115457464 B CN 115457464B
- Authority
- CN
- China
- Prior art keywords
- layer
- crowd
- feature
- convolution
- scale
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 37
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 28
- 238000012549 training Methods 0.000 claims abstract description 21
- 230000002776 aggregation Effects 0.000 claims abstract description 14
- 238000004220 aggregation Methods 0.000 claims abstract description 14
- 230000005540 biological transmission Effects 0.000 claims abstract description 6
- 230000009467 reduction Effects 0.000 claims abstract description 6
- 239000013598 vector Substances 0.000 claims description 22
- 238000004364 calculation method Methods 0.000 claims description 14
- 230000004913 activation Effects 0.000 claims description 9
- 230000002708 enhancing effect Effects 0.000 claims description 4
- 238000005070 sampling Methods 0.000 claims description 4
- 238000007781 pre-processing Methods 0.000 claims 3
- 238000002372 labelling Methods 0.000 claims 2
- 230000006870 function Effects 0.000 description 10
- 238000010586 diagram Methods 0.000 description 8
- 239000011159 matrix material Substances 0.000 description 4
- 238000001514 detection method Methods 0.000 description 3
- 238000009499 grossing Methods 0.000 description 3
- 230000006872 improvement Effects 0.000 description 2
- 230000006698 induction Effects 0.000 description 2
- 230000004931 aggregating effect Effects 0.000 description 1
- 230000001186 cumulative effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000011478 gradient descent method Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000003014 reinforcing effect Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/50—Context or environment of the image
- G06V20/52—Surveillance or monitoring of activities, e.g. for recognising suspicious objects
- G06V20/53—Recognition of crowd images, e.g. recognition of crowd congestion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/70—Determining position or orientation of objects or cameras
- G06T7/73—Determining position or orientation of objects or cameras using feature-based methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/42—Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10024—Color image
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/30—Subject of image; Context of image processing
- G06T2207/30242—Counting objects in image
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y04—INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
- Y04S—SYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
- Y04S10/00—Systems supporting electrical power generation, transmission or distribution
- Y04S10/50—Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a crowd counting method based on a transducer and CNN, which comprises the following steps: obtaining a training sample and carrying out pretreatment enhancement; inputting the enhanced RGB image into a model backbone network to obtain global feature images with different resolutions; carrying out channel superposition after upsampling on global feature graphs with different resolutions to obtain an aggregate feature graph; inputting the aggregation feature map into a multi-branch convolutional neural network to obtain a multi-scale feature map, and adding the multi-scale feature map in the channel dimension to obtain the multi-scale aggregation feature map; inputting the multi-scale aggregation feature map into a density map regression layer to carry out smooth dimension reduction and output a density map; training is carried out by using the optimal transmission loss, and finally prediction is carried out. The pyramid transducer is combined with the multi-branch convolutional neural network, so that the receptive field of the model is increased, the influence of scale variability is effectively reduced, and the prediction accuracy is improved.
Description
Technical Field
The invention relates to a crowd counting method based on a transducer and CNN, belonging to the field of image processing.
Background
Along with the rapid development of society and the continuous improvement of living standard of people, people gathering scenes are more and more, especially, public places such as large venues, transportation hubs, market marts and the like, the demands of the scenes on safety are continuously improved, so that people in the scenes are mastered to carry out safety management and emergency evacuation, and the method has received extensive attention of researchers. The crowd density estimation method is to estimate the density of the crowd in a given scene, generate a density map with distribution information and give the total number of the crowd.
At present, three methods are mainly used for crowd density estimation: a detection-based method, a regression-based method and a density map-based method. The pedestrian targets are counted through the manually designed window detector based on the detection method, and the detection effect is poor in the crowd-intensive area. The regression-based method directly carries out regression on the population, and lacks crowd space information. Compared with the first two methods, the method based on the density map can output the crowd density map while providing the total number count, provides key information of crowd space distribution, has certain adaptability to dense areas, increases the accuracy of counting, and reduces the difficulty of method design. Therefore, the current crowd density estimation method is mainly a method based on a density map.
Most of the existing crowd density estimation methods based on density maps adopt a convolutional neural network model in deep learning to carry out density map regression, but the crowd targets have serious scale changeability problems, the convolutional neural network model receptive field is limited, multi-scale features cannot be effectively captured, and finally the accuracy of a counting model is reduced. Therefore, how to increase the receptive field of the model and reduce the influence of scale variability on the counting accuracy becomes a difficult problem to be solved.
Disclosure of Invention
Aiming at the problems of variable scale in crowd scenes and limited receptive field in a convolutional neural network model, the invention provides a crowd counting method based on a transducer and CNN, which comprises the following steps: the pyramid transformer is combined with the convolutional neural network, and the pyramid transformer is used for learning global features of the image, so that the model has a global receptive field, and the influence of the scale variable problem on the accuracy is reduced; meanwhile, multi-scale feature learning and enhancement are performed through a multi-branch convolutional neural network, feature representation of a model is enriched, locality and induction bias are provided, and finally density map regression is accurately performed.
In order to solve the technical problems, the invention adopts the following technical scheme:
the crowd counting method based on the transformer and the CNN comprises the following steps:
(1) And acquiring training samples, obtaining a large number of crowd RGB images in multiple scenes, then acquiring crowd labels, and carrying out pixel point labels at each head position, wherein the number of the pixel points represents the total number of people in the scenes. Then the image is enhanced, randomly turned horizontally or vertically and standardized, and the training image is cut into 256×256 sizes for training.
(2) And (3) inputting the enhanced crowd RGB images into a backbone network of a model to calculate to obtain global feature images with different resolutions, wherein the backbone network is a pyramid transducer formed by four stages, and each stage comprises an overlapped image block embedding layer and an encoder.
Further, in the overlapped image block embedding layer, the input image is divided into image blocks overlapped with each other in one convolution layer, then a convolution operation is performed to output a two-dimensional feature map, the output two-dimensional feature map is expanded into a one-dimensional vector, and regularization is performed, and the vector is input as an encoder. The convolution kernel size of the convolution layer of the overlapped image block embedded layer in the first stage is 7 multiplied by 7, and the step length is 4; the convolution kernel size of the convolution layer in the other three-stage overlapped image block embedded layer is 3 multiplied by 3, and the step length is 2; the output dimensions of the four layers of convolution layers are as follows: 64, 128, 320, 512. The pyramid-type different resolution characteristic diagrams can be output by controlling the step length of the convolution layer.
Further, in the encoder, the input vector is self-attentive calculated through a plurality of blocks, each block includes a self-attentive calculation layer and a forward propagation layer, and each layer is connected in a jump connection manner. The number of blocks in each stage is as follows: 3,8, 27,3; the number of heads of the multi-head self-attention layer in each stage is 1,2,5 and 8 respectively. The vector after encoder calculation is reshaped into a two-dimensional feature map and used as input for the next stage. And finally, outputting four groups of global feature maps with different resolutions at four stages, wherein the resolutions are sequentially 1/4,1/8,1/16 and 1/32 of the input enhanced crowd RGB image resolution.
(3) Firstly, up-sampling four groups of global feature images with different resolutions extracted in the step (2) to the same resolution while keeping the number of channels unchanged, and up-sampling the feature images of the last three stages to the resolution of the feature images of the first stage by a bilinear interpolation method, namely 1/4 of the size of the input enhanced crowd RGB image.
And then the feature images of the four stages are aggregated, wherein the aggregation method is to superimpose all the feature images of the four stages in the channel dimension, the total channel number is the sum of the channel numbers of the feature images of the four stages, namely 64+128+320+512=1024, and finally the aggregation feature image with the channel number of 1024 is obtained.
(4) And inputting the aggregated feature map into a multi-branch convolutional neural network module of the network model to obtain a multi-scale feature map. The multi-branch convolutional neural network module comprises three branches, each branch comprises a convolutional layer, the first branch has a convolutional kernel size of 3×3, the second branch has a convolutional kernel size of 5×5, and the third branch has a convolutional kernel size of 7×7. The number of output channels per branch is 256 and the convolution layer of each branch is followed by a batch regularization layer and a ReLU activation function layer.
After three branch calculation, three groups of multi-scale characteristic diagrams with the same resolution and channel number are obtained. And then adding pixels of the multi-scale feature images on the corresponding channels, specifically adding pixels of the corresponding positions of the three feature images on each corresponding channel, and finally obtaining 256 channels of the multi-scale aggregate feature images.
(5) And (3) inputting the multi-scale aggregation feature map output in the step (4) into a density map regression layer to carry out smooth dimension reduction and output a density map. The density map regression layer comprises two layers of convolution layers, wherein the convolution kernel size of the first layer of convolution layer is 3 multiplied by 3, the step length is 1, the number of output channels is 64, the convolution kernel size of the second layer of convolution layer is 1 multiplied by 1, the step length is 1, the number of output channels is 1, a batch of regularization layers and a ReLU activation function are arranged behind each layer of convolution layer, and finally the crowd density estimation map and crowd counting result are output.
(6) Training by using optimal transmission loss, carrying out regression on a crowd density estimation graph and the total number of people, optimizing model parameters, and then storing the model parameters with minimum loss; and loading the stored minimum loss model parameters during prediction, and directly acquiring a crowd density estimation graph and a crowd counting result as prediction results.
By adopting the technical scheme, the invention has the following technical progress:
1. the pyramid-type backbone network can output feature graphs with different resolutions, so that crowd targets with different scales can have rich feature representation information, wherein the high-resolution feature graph has rich details, so that the prediction of small-scale crowd is facilitated, the low-resolution feature graph has rich semantic information, the prediction of large-scale crowd is facilitated, and the accuracy of crowd density graph estimation graph can be improved by aggregating the feature graphs with different resolutions.
2. The transformers used have a global receptive field and can be modeled using all input pixels. By taking the transformer as a model backbone network, the receptive field of the model can be increased, the defect that the traditional convolutional neural network model only has a local receptive field is overcome, the influence of the scale variable problem in the crowd scene on the model can be effectively reduced, and the accuracy is improved.
3. And a multi-branch convolutional neural network is adopted, global feature graphs are learned by using convolution cores with different sizes, detail representations of different scale features in the feature graphs are enriched, the influence of the scale changeability problem is further reduced, and regression on the density graph is finally completed.
4. The method combines the transformers with the convolutional neural network, so that the model has global receptive field and locality, the problem that a single transformer needs a large amount of data to train due to lack of visual task induction bias and poor generalization is avoided, meanwhile, the problem that the convolutional neural network receptive field is limited is solved, and the method has strong adaptability to scale-variable scenes.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a schematic diagram of the overall network architecture of the present invention;
FIG. 3 is a schematic illustration of training of the present invention;
fig. 4 is a schematic diagram of crowd density estimation using the present invention.
Detailed Description
The invention is further illustrated by the following examples:
FIG. 1 is a flow chart of a crowd density estimation method of the invention. As shown in fig. 1, the method for estimating the crowd density in the invention comprises the following steps:
(1) Training samples of crowd RGB images are acquired, and a large number of crowd RGB images are acquired from a plurality of scenes. And meanwhile, marking data of RGB images of the corresponding crowd are obtained, pixel point marking is carried out on the head of the target crowd, one marked pixel point represents a pedestrian, the sum of the pixel points is the total number of people, and the abscissa and the ordinate of the marked point are finally obtained. And then reinforcing the training sample, adopting random horizontal and vertical overturning, and performing standardization. At training time, images are randomly cropped to 256×256 image block sizes for training.
The crowd RGB image is composed of three channels of R (red), G (green) and B (blue), the crowd RGB image is standardized in the channel dimension, and the average value and the standard deviation of the three channels are restrained, so that the enhanced crowd RGB image is obtained. The standardized mode is as follows:
wherein C is channel data, C' is normalized channel data, mean is channel mean, and std is channel standard deviation. And then, the pyramid transducer backbone network of the enhanced crowd RGB image input model is used as a prediction input.
(2) Figure 2 is a schematic diagram of the overall network structure of the present invention, and the network is built. The pyramid transformation backbone network consists of four stages, after the input is completed, the enhanced crowd RGB image sequentially passes through the four stages of the backbone network to obtain feature images with different resolutions, and the resolution of the output feature image of each stage is sequentially 1/4,1/8,1/16 and 1/32 of the resolution of the enhanced crowd RGB image.
Each stage of the backbone network includes an overlapping tile embedding layer and an encoder. In each stage, the input feature map firstly passes through an image block embedding layer, the two-dimensional feature map is mapped into one-dimensional vectors, the one-dimensional vectors are subjected to self-attention calculation in the encoder, and global feature maps are obtained after the two-dimensional feature map is remodeled and used as the input of the next stage.
The structure and steps of each stage are described in detail as follows:
the overlapped image block embedding layer in the first stage comprises a convolution layer and a regularization layer, specifically, the convolution kernel of the convolution layer is 7×7, the step size is 4, the number of input channels is 3, and the number of output channels is 64. The step length is larger than 1, so that the resolution of the enhanced crowd RGB image is reduced to 1/4 of the input image after passing through the convolution layer, and the step length is smaller than the size of the convolution kernel, so that the convolution kernel can overlap the image blocks, and local information interaction is increased. And then the output two-dimensional feature map is flattened into a one-dimensional vector, and the calculation requirements can be met while all pixel information is reserved. And (3) inputting the flattened vector into a regularization layer for regularization treatment in the channel dimension, so that model convergence is facilitated. And finally, regularized input encoders. The first stage encoder comprises 3 blocks, each block is composed of a multi-head self-attention layer and a forward propagation layer, and each layer is connected in a jump connection mode. The self-attention computation layer comprises a regularization layer and a multi-head self-attention layer, wherein the regularization layer regularizes the input vector in the channel dimension. The first stage multi-headed self-attention layer has a head number of 1, the input vector dimension of 64, and the vector after calculation is input to the forward propagation layer.
The self-attention calculation mode is as follows:
wherein Q, K, V represent query matrix, key matrix, value matrix, and weight matrix W is obtained by inputting vector Q ,W K ,W V Multiplication to obtain d k The dimensions of the input vector are calculated for multi-headed self-attention.
The softmax was calculated as:
the forward propagation layer consists of two fully connected layers and one deep convolutional layer. The output dimension of the first full-connection layer is 8 times of the input dimension, the output dimension of the second full-connection layer is the dimension of the input vector of the forward propagation layer, a deep convolution layer is arranged between the two full-connection layers, the deep convolution layer firstly reshapes the vector into a two-dimensional feature map, then the two-dimensional feature map is divided into groups with the same dimension as the channel dimension, independent convolution is carried out on each group to be used as position coding of the feature map, the convolution kernel size is 3 multiplied by 3, the step size is 1, the input channel and the output channel are the same, and the number of channels of the input vector is the same. After the calculation is completed, the two-dimensional feature map is flattened into a vector to be used as the output of the second full connection layer. After the first full connection layer and the deep convolution layer, the full connection layer and the deep convolution layer are activated through a GELU activation function, wherein the activation function is calculated in the following way:
GELU(x)=xP(X≤x) (4)
where x is the input vector and P is the cumulative distribution of the gaussian distribution. The vector passing through the forward propagation layer is then reshaped into a two-dimensional feature map as input to the second stage.
Further, the second stage, the third stage and the fourth stage are similar to the first stage in composition structure, and only part of layers are different. Second, the convolution kernel of the convolution layer in the overlapped image block embedded layer in the third and fourth stages is 3×3, the step length is 2, and the output dimensions are 128, 320 and 512 in sequence; the number of blocks in the encoder is 4, 18,3 in sequence; the number of the multi-head self-attention layers is 2,5 and 8 respectively; the number of output channels of the first full-connection layer in the forward propagation layer is 8 times, 4 times and 4 times of the number of input channels in sequence. The outputs of the first, second and third stages are all used as inputs of the next stage. Finally, four stages output four groups of two-dimensional global feature maps with different resolutions.
(3) The global feature map of the four stages is up-sampled, the output feature map of the second stage is up-sampled twice, the output feature map of the third stage is up-sampled four times, the output feature map of the fourth stage is up-sampled eight times, and the output feature map of the first stage is unchanged in size through a bilinear interpolation method.
Further, the four-stage feature images are up-sampled to the same size and then overlapped in the channel dimension to obtain an aggregate feature image, the output channel number of the aggregate feature image is the sum of the four stage channel numbers, and the resolution is 1/4 of the enhanced crowd RGB image.
(4) And inputting the aggregated feature map into a multi-branch convolutional neural network module to obtain a multi-scale feature map. The multi-branch convolutional neural network module consists of three branches, wherein each branch comprises a convolutional layer, a regularization layer and a ReLU activation function. Specifically, the convolution kernel sizes of the convolution layers in the three branches are 3×3,5×5, and 7×7 in sequence, and the number of output channels is 256. The calculation mode of the ReLU activation function is as follows:
further, after the aggregation feature map passes through three branches, three groups of multi-scale feature maps with the same resolution and channel number are output, then the multi-scale feature maps are added according to pixel positions in the channel dimension, specifically, on channels corresponding to the three groups of feature maps, pixels at the same positions are added, the output resolution is 1/4 of the enhanced crowd RGB image, the output channel is 256, and the multi-scale aggregation feature map is obtained.
(5) And (3) smoothing and dimension reduction are carried out on the multi-scale aggregation feature map, and finally, a crowd density estimation map and a crowd counting result are output. The smoothing and dimension reduction operations are performed by two convolutional layers, each followed by a regularization layer and a ReLU activation function. The convolution kernel size of the first convolution layer is 3×3, the output channel is 64, the convolution kernel size of the second convolution layer is 1×1, and the number of output channels is 1. The feature map after dimension reduction and smoothing is a final crowd density estimation map, and all pixels in the density map are added to obtain a final crowd statistic result.
(6) Fig. 3 is a schematic diagram of training the crowd density estimation method of the present invention, first training a network, and the loss function adopted by the present invention is an optimal transmission loss, as follows:
L=L 1 +L OT (6)
L 1 =||S|| 1 -||S'|| 1 (7)
where L is the overall loss function of the network model, L 1 Optimizing the statistics result of the number of people as the absolute error of the total number of people, L OT And optimizing the distribution of the crowd density estimation graph for optimal transmission loss. S is a person group annotation point diagram, S 1 Is a norm of S; s 'is a people group density estimation graph, S' || 1 Is a norm of S'. Phi is the Wasserstein distance calculation.
During training, model parameters are optimized through a back propagation gradient descent method, a target threshold is set, and model parameters with the minimum loss function are selected for storage, so that model training is completed. FIG. 4 is a schematic diagram of crowd density estimation using the crowd density estimation method of the invention. And directly inputting the image to be predicted into a network without enhancing during prediction to obtain a crowd density estimation graph and the total number of people as final prediction results.
The above examples are only illustrative of the preferred embodiments of the present invention and are not intended to limit the scope of the present invention, and various modifications and improvements made by those skilled in the art to the technical solution of the present invention should fall within the scope of protection defined by the claims of the present invention without departing from the spirit of the present invention.
Claims (6)
1. The crowd counting method based on the transformer and the CNN is characterized by comprising the following steps of: the method comprises the following steps:
(1) Obtaining training samples, obtaining crowd RGB images of multiple scenes, and preprocessing and enhancing the crowd RGB images;
(2) Inputting the enhanced crowd RGB images into a backbone network of a model for calculation, wherein the backbone network comprises a pyramid transducer formed by four stages, and the enhanced crowd RGB images sequentially pass through the four stages of the backbone network to obtain global feature images with different resolutions; wherein each stage includes an overlapping tile embedding layer and an encoder;
(3) Carrying out channel superposition after upsampling on global feature graphs with different resolutions to obtain an aggregate feature graph;
the step (3) specifically comprises:
firstly, up-sampling four groups of global feature images with different resolutions extracted in the step (2) to the same resolution, keeping the number of channels unchanged, and up-sampling the feature images of the last three stages to the resolution of the feature images of the first stage by a bilinear interpolation method, namely enhancing the 1/4 size of the RGB image of the crowd;
then, the feature images of the four stages are aggregated, wherein the aggregation method is to superimpose all the feature images of the four stages in the channel dimension, the total channel number is the sum of the channel numbers of the feature images of the four stages, namely 64+128+320+512=1024, and finally, the aggregation feature image with the channel number of 1024 is obtained;
(4) Inputting the aggregation feature map into a multi-branch convolutional neural network to obtain a multi-scale feature map, and adding the multi-scale feature map in the channel dimension to obtain the multi-scale aggregation feature map;
the step (4) specifically comprises:
the multi-branch convolutional neural network module comprises three branches, each branch comprises a convolutional layer, the first branch has a convolutional kernel size of 3 multiplied by 3, the second branch has a convolutional kernel size of 5 multiplied by 5, and the third branch has a convolutional kernel size of 7 multiplied by 7; the number of output channels of each branch is 256, and a batch regularization layer and a ReLU activation function layer are arranged behind the convolution layer of each branch;
after three branch calculation, three groups of multi-scale feature images with the same resolution and channel number are obtained; then adding the multi-scale feature images pixel by pixel on the corresponding channels, specifically adding pixels at positions corresponding to the three feature images on each corresponding channel, and finally obtaining 256 channels of the multi-scale aggregate feature images;
(5) Inputting the multi-scale aggregation feature map into a density map regression layer to carry out smooth dimension reduction and output a density map;
(6) Training is carried out by using the optimal transmission loss, and finally prediction is carried out.
2. The transformer and CNN based population counting method of claim 1, wherein: the step (1) is to acquire labeling data of the RGB images of the crowd before preprocessing and enhancing, pixel point labeling is carried out at each head position, and the number of the pixel points represents the total number of people in the scene.
3. The transformer and CNN based population counting method of claim 1, wherein: the preprocessing enhancement in the step (1) specifically comprises random horizontal or vertical overturn and standardization, and training is carried out by cutting a training image into 256×256 image blocks during training.
4. The transformer and CNN based population counting method of claim 1, wherein: the step (2) specifically comprises:
in the overlapped image block embedding layer, an input image is divided into mutually overlapped image blocks in one convolution layer, then a two-dimensional feature map is output through convolution operation, and the output two-dimensional feature map is unfolded into one-dimensional vectors and regularized to be used as encoder input; the convolution kernel size of the convolution layer of the overlapped image block embedded layer in the first stage is 7 multiplied by 7, and the step length is 4; the convolution kernel size of the convolution layer in the other three-stage overlapped image block embedded layer is 3 multiplied by 3, and the step length is 2; the output dimensions of the four layers of convolution layers are as follows: 64 128, 320, 512; outputting pyramid-type feature graphs with different resolutions by controlling the step length of the convolution layer;
in the encoder, the input vector carries out self-attention calculation through a plurality of blocks, each block comprises a self-attention calculation layer and a forward propagation layer, and each layer is connected in a jump connection mode; the number of blocks in each stage is as follows: 3,8, 27,3; the number of the multi-head self-attention layers in each stage is 1,2,5 and 8 respectively; the vector after being calculated by the encoder is remodeled into a two-dimensional feature map and is used as input of the next stage; and finally, outputting four groups of global feature maps with different resolutions at four stages, wherein the resolutions are sequentially 1/4,1/8,1/16 and 1/32 of the input enhanced crowd RGB image resolution.
5. The transformer and CNN based population counting method of claim 1, wherein: the step (5) specifically comprises:
the density map regression layer comprises two layers of convolution layers, wherein the convolution kernel size of the first layer of convolution layer is 3 multiplied by 3, the step length is 1, the number of output channels is 64, the convolution kernel size of the second layer of convolution layer is 1 multiplied by 1, the step length is 1, the number of output channels is 1, a batch of regularization layers and a ReLU activation function are arranged behind each layer of convolution layer, and finally the crowd density estimation map and crowd counting result are output.
6. The transformer and CNN based population counting method of claim 1, wherein: the step (6) specifically comprises:
training by using optimal transmission loss, carrying out regression on a crowd density estimation graph and the total number of people, optimizing model parameters, and then storing the model parameters with minimum loss; and loading the stored minimum loss model parameters during prediction, and directly acquiring a crowd density estimation graph and a crowd counting result as prediction results.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211084706.1A CN115457464B (en) | 2022-09-06 | 2022-09-06 | Crowd counting method based on transformer and CNN |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211084706.1A CN115457464B (en) | 2022-09-06 | 2022-09-06 | Crowd counting method based on transformer and CNN |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115457464A CN115457464A (en) | 2022-12-09 |
CN115457464B true CN115457464B (en) | 2023-11-10 |
Family
ID=84302181
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211084706.1A Active CN115457464B (en) | 2022-09-06 | 2022-09-06 | Crowd counting method based on transformer and CNN |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115457464B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115861930B (en) * | 2022-12-13 | 2024-02-06 | 南京信息工程大学 | Crowd counting network modeling method based on hierarchical difference feature aggregation |
CN117952869B (en) * | 2024-03-27 | 2024-06-18 | 西南石油大学 | Drilling fluid rock debris counting method based on weak light image enhancement |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109271960A (en) * | 2018-10-08 | 2019-01-25 | 燕山大学 | A kind of demographic method based on convolutional neural networks |
CN113537393A (en) * | 2021-08-09 | 2021-10-22 | 南通大学 | Dark scene three-dimensional human body posture estimation algorithm based on improved Transformer |
CN114821357A (en) * | 2022-04-24 | 2022-07-29 | 中国人民解放军空军工程大学 | Optical remote sensing target detection method based on transformer |
-
2022
- 2022-09-06 CN CN202211084706.1A patent/CN115457464B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109271960A (en) * | 2018-10-08 | 2019-01-25 | 燕山大学 | A kind of demographic method based on convolutional neural networks |
CN113537393A (en) * | 2021-08-09 | 2021-10-22 | 南通大学 | Dark scene three-dimensional human body posture estimation algorithm based on improved Transformer |
CN114821357A (en) * | 2022-04-24 | 2022-07-29 | 中国人民解放军空军工程大学 | Optical remote sensing target detection method based on transformer |
Non-Patent Citations (5)
Title |
---|
CCtrans: Simplifying and improving crowd counting with transformer;Tian ye 等;《arXiv》;第1-11页 * |
CrowdFormer: An overlap patching vision transformer for top-down crowd counting;Yang Shangpeng 等;《Proceedings of the International Joint Conference on Artificial Intelligence》;第1545-1551页 * |
CrowdFormer: Weakly-supervised Crowd counting with Improved Generalizability;Siddharth Singh Savner 等;《arXiv》;第1-6页 * |
PVT v2: Improved Baselines with Pyramid Vision Transformer;Wang Wenhai 等;《arXiv》;第1-8页 * |
Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers;Zheng Sixiao 等;《IEEE/CVF Conference on Computer Vision and Pattern Recognition》;第6877-6886页 * |
Also Published As
Publication number | Publication date |
---|---|
CN115457464A (en) | 2022-12-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN115457464B (en) | Crowd counting method based on transformer and CNN | |
CN111611878B (en) | Method for crowd counting and future people flow prediction based on video image | |
CN110348376B (en) | Pedestrian real-time detection method based on neural network | |
CN112183258A (en) | Remote sensing image road segmentation method based on context information and attention mechanism | |
CN115049936A (en) | High-resolution remote sensing image-oriented boundary enhancement type semantic segmentation method | |
CN109800629A (en) | A kind of Remote Sensing Target detection method based on convolutional neural networks | |
US20220358765A1 (en) | Method for extracting oil storage tank based on high-spatial-resolution remote sensing image | |
CN111582029A (en) | Traffic sign identification method based on dense connection and attention mechanism | |
CN110399820B (en) | Visual recognition analysis method for roadside scene of highway | |
CN114638836B (en) | Urban street view segmentation method based on highly effective driving and multi-level feature fusion | |
CN115641473A (en) | Remote sensing image classification method based on CNN-self-attention mechanism hybrid architecture | |
CN114022408A (en) | Remote sensing image cloud detection method based on multi-scale convolution neural network | |
CN115311194A (en) | Automatic CT liver image segmentation method based on transformer and SE block | |
CN110097028A (en) | Crowd's accident detection method of network is generated based on three-dimensional pyramid diagram picture | |
CN114360067A (en) | Dynamic gesture recognition method based on deep learning | |
CN113313031B (en) | Deep learning-based lane line detection and vehicle transverse positioning method | |
CN109903373A (en) | A kind of high quality human face generating method based on multiple dimensioned residual error network | |
CN110599502A (en) | Skin lesion segmentation method based on deep learning | |
CN117876824B (en) | Multi-modal crowd counting model training method, system, storage medium and equipment | |
CN115841625A (en) | Remote sensing building image extraction method based on improved U-Net model | |
CN115100165B (en) | Colorectal cancer T-staging method and system based on CT image of tumor area | |
CN113139489A (en) | Crowd counting method and system based on background extraction and multi-scale fusion network | |
CN118134952A (en) | Medical image segmentation method based on feature interaction | |
CN111222453A (en) | Remote sensing image change detection method based on dense connection and geometric structure constraint | |
CN113673478B (en) | Port large-scale equipment detection and identification method based on deep learning panoramic stitching |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |