CN116740439A

CN116740439A - Crowd counting method based on trans-scale pyramid convertors

Info

Publication number: CN116740439A
Application number: CN202310700559.4A
Authority: CN
Inventors: 雷涛; 张少乐; 王营博; 薛明园; 何熙
Original assignee: Shaanxi University of Science and Technology
Current assignee: Shaanxi University of Science and Technology
Priority date: 2023-06-14
Filing date: 2023-06-14
Publication date: 2023-09-12

Abstract

The application discloses a crowd counting method based on a trans-former of a trans-scale pyramid, which can effectively capture local and global information of crowd images, the capturing capability of the crowd targets with different scales is enhanced, and the problem of low image counting precision of the dense crowd caused by complex background and large target scale change is effectively solved. Firstly, obtaining a crowd image sample, dividing the crowd image sample into image blocks with fixed sizes through Patch Embedding, and inputting the processed image blocks into a pyramid transducer to obtain three feature images with different resolutions; then, up-sampling feature images with different resolutions to the same resolution, and inputting the feature images into a multi-scale cavity convolution module for superposition to generate a multi-scale aggregation feature image; then, inputting the multi-scale aggregation feature map into a pyramid average pooling output density map; and finally, training and optimizing by using a depth supervision method, and finally, predicting and regressing.

Description

Crowd counting method based on trans-scale pyramid convertors

Technical Field

The application belongs to the field of image processing and the field of computer vision, and particularly relates to a crowd counting method based on a trans-scale pyramid.

Background

With the continuous promotion of the urban process and the continuous increase of population, the population density of urban public places is also continuously increased, and the management and safety problems of the densely populated places become the first difficult problems of urban managers and public safety institutions. To solve these problems, crowd counts have been developed.

Early population count methods typically used detection-based methods such as sliding window and head detection. Although the methods are simple and feasible, the methods are often influenced by factors such as high density, target overlapping, background interference and the like, so that the counting result is inaccurate. In order to solve these problems, regression-based methods, such as a cost-sensitive sparse linear regression model proposed in patent "a robust population counting method based on cost-sensitive sparse linear regression", have been proposed, and the main idea of the method is to learn a mapping from a feature to the population number, and by establishing a regression model between the population number and the population density map, the overall counting result can be easily obtained. Although regression-based methods improve the counting performance as a whole, they ignore spatial information in the image, only get one counting result, and lack reliability and interpretability.

With the development of computer vision technology, a crowd counting method based on deep learning is becoming the mainstream. Among them, the density map-based method is one of the most popular and effective methods at present, which is to map an input image onto one density map and then estimate the population by regression of the number of pixels in the density map. The patent discloses a deep learning-based image high-density crowd counting method, which is used for counting the crowd by using a convolutional neural network in deep learning without manual intervention and designing a complex feature extraction method. However, in the area with dense crowd, the method cannot distinguish the characteristics of people more finely, and meanwhile, local characteristics are difficult to extract for smaller scale in the image with wide crowd density distribution. To address this issue, attention mechanisms have been introduced into crowd counting methods. The essence of attention is from the whole attention to the focus, and has the advantages of few parameters, high speed, good effect and the like. The attention mechanism is introduced in the crowd counting method, so that the problems of personnel shielding, uneven crowd distribution and the like are solved. However, the method of distraction still does not effectively capture global information.

The transducer is a neural network based on a self-attention mechanism, and can obtain global dependency in an input sequence. A population count model based on a transducer, transcrowd, was proposed, which is the first population count study based on a transducer, which reconstructs the population count problem from the perspective of sequence counting based on a transducer, and experimental results show that the model achieves very good performance in dense population count tasks. Although the transducer has strong advantages in extracting context information, it is still insufficient in local information acquisition, so the method of combining local information and global information to improve model performance is still an important research point.

Disclosure of Invention

In order to solve the problems in the prior art, the application provides a crowd counting method based on a trans-former of a trans-scale pyramid, which can effectively capture local and global information of crowd images, enhance the capturing capability of crowd targets with different scales, and effectively solve the problem of low counting precision of dense crowd images caused by complex background and large target scale change.

In order to achieve the above object, the technical scheme adopted by the application comprises the following steps: firstly, acquiring a training sample, carrying out pixel point labeling on a sample image, and flattening the sample image into a real density map by using a Gaussian kernel; then, inputting the cut training sample image into a Patch Embedding module for preprocessing to form a one-dimensional vector; secondly, inputting the processed one-dimensional vector into a transform backbone network to obtain three-branch feature images with different resolutions, and up-sampling to the same size and dimension; inputting the three-branch feature map into a multi-scale cavity convolution module again, and adding the enhanced feature map pixel by pixel on a corresponding channel; then the output three-branch feature map is used for auxiliary training by mean square error loss; then inputting the output aggregate feature map into a pyramid average pooling module for feature map processing, and carrying out convolution regression on the output feature map to obtain a predicted density map; and finally, training and optimizing by using a weighted sum of the counting loss, the optimal transmission loss and the mean square error loss, and predicting a crowd density prediction graph and a crowd counting result.

Further, the training sample acquisition is to acquire RGB images of different time, different places and different head sizes from different scenes; the pixel point labeling is to label the sample by data, label the pixel point on the head of the target crowd in the image, wherein one pixel point represents a pedestrian, and the sum of the labeled pixel points is the total number of people; the Gaussian kernel flattening is to use the Gaussian kernel function formula for generating the density map for loss training by smoothing the image marked by the Gaussian kernel pixel points, wherein the Gaussian kernel function formula for generating the density map is as follows:

wherein:

wherein delta is impulse function, i and j are pixel points, x _i Delta (x-x _i ) As an impulse response function of the position of the head in the image,is Gaussian kernel, N is total number of human heads in the image, < ->Is a distance x _i Head nearest m heads are the average distance from the head, < >>In the case of a dense population, approximately equal to the size of the head, β=0.3 is taken.

Further, the training sample image clipping is to clip the image into 224×224 size, input the image into a Patch editing module, and the input image is convolved with a convolution kernel size of 16×16, a stride of 16, and the number of convolution kernels of 768, which is changed from [224,224,3] to [14,14,768], and the two dimensions of height and width are flattened to become a one-dimensional vector [196,768].

Further, the one-dimensional vector is input into a transducer backbone network for feature extraction to obtain feature images with the resolution of 1/8,1/16 and 1/32 of the input image resolution;

the transducer backbone network comprises a pyramid transducer formed by four stages, wherein each stage comprises a common convolution layer, a max pooling layer and a transducer encoder module;

firstly, inputting a one-dimensional vector as an encoder in a transducer backbone network, wherein the convolution kernel size of each stage is 7 multiplied by 7, the step length is 2, the pooling kernel size of a pooling layer is 3 multiplied by 3, the step length is 2, the number of output channels of four stages is 128, 256, 512 and 1024, and the sizes of output characteristic diagrams of the four stages are respectively 1/4,1/8,1/16 and 1/32 of the size of an original diagram;

then, in the transducer encoder module, the input is self-attentive calculated over a plurality of blocks, the first block comprising a LayerNorm layer, a local multi-headed self-attentive calculation layer, a LayerNorm layer, and an MLP layer; the second block contains a LayerNorm layer, a global multi-headed self-attention calculation layer, a LayerNorm layer, and an MLP layer; the layers are connected in a jumping connection mode, and the number of each block in the four stages is 2, 18 and 2 respectively; the number of multi-headed self-attention layer heads per stage is 4,8, 16 and 32, respectively.

Further, the up-sampling is to up-sample the obtained feature images with the resolution of 1/16 and 1/32 into feature images with the size of 1/8 through bilinear interpolation, and change the feature image channels into 128 through 1×1 convolution, so as to obtain three feature images with the same resolution and dimension, wherein the calculation formula of the bilinear interpolation is as follows:

f(x,y)≈(1-a)(1-b)f(x ₁ ,y ₁ )+a(1-b)f(x ₂ ,y ₂ )+abf(x ₃ ,y ₃ )+(1-a)bf(x ₄ ,y ₄ )

wherein a=x-x ₁ ，b＝y-y ₁ The method comprises the steps of carrying out a first treatment on the surface of the f (x, y) is a target pixel value to be calculated; for each pixel point (x, y) in the new image, four adjacent pixel points (x ₁ ,y ₁ )、(x ₂ ,y ₂ )、(x ₃ ,y ₃ ) And (x) ₄ ,y ₄ ) The four points are located at four pixel points closest to (x, y) on the original image along x and y directions, respectively, distance weights of each adjacent pixel point in the x and y directions are calculated, and the values of the four neighborhood pixels are weighted and averaged according to the weights.

Further, the three-branch feature map input multi-scale hole convolution module is that three obtained feature maps with 1/8 resolution are input into the multi-scale hole convolution module to obtain multi-scale feature maps with different information, the multi-scale hole convolution module comprises four small branches, the first branch consists of a convolution kernel with 1 multiplied by 1, and an output channel is 384; the second branch consists of a convolution kernel of 1 multiplied by 1, a cavity rate of 1 and a convolution kernel of 3 multiplied by 3, and the output channel is 128; the third branch consists of a convolution kernel of 3×3, a cavity rate of 2, and a convolution kernel of 3×3 with a kernel size, and the output channel is 128; the fourth branch consists of a convolution kernel of 5×5, a convolution kernel with a void fraction of 3 and a kernel size of 3×3, and the output channel is 128; the second branch, the third branch and the fourth branch are spliced and then added with the first branch to output an enhanced characteristic diagram, the size of the final output characteristic diagram is 56 multiplied by 56, and the number of channels is 384;

the pixel-by-pixel addition is to output three-branch enhancement feature images with the same resolution and channel number, then to add the three multi-scale feature images on the corresponding channels pixel by pixel, wherein the three-branch feature images are 56×56 in size, 384 in channel number, 56×56 in size and 384 in channel number.

Further, the mean square error loss auxiliary training is to carry out auxiliary regression on the output three-branch characteristic diagram and the real density diagram, optimize model parameters and set the weight coefficients of auxiliary loss functions with different depths as 0.1,0.2,0.3 respectively;

the mean square error loss function measures the distance between the predicted value and the true value, and the smaller the value is, the better the fitting effect of the model is indicated, and a specific calculation formula is as follows:

where y represents the true value of the sample,representing the predicted value of the model, n represents the number of samples, the meaning of the equation is to divide the sum of squares of the prediction error for each data point by the number of samples n.

Further, the feature map processing is to input the output aggregate feature map into a pyramid average pooling module, wherein the pyramid average pooling module is a four-level module, the convolution kernel sizes are 2×2, 3×3, 5×5 and 7×7 respectively, each branch comprises a convolution kernel of 1×1 and up-sampling, the output is kept to be of the same resolution and dimension, the output of the four-branch feature map is spliced with the output aggregate feature map, the final regression feature map is output, the size is 56×56, and the channel number is 384.

Further, the convolution regression is that the output regression characteristic diagram is subjected to 1×1 convolution to carry out smooth dimension reduction and regression to obtain a prediction density diagram, the convolution layer is subjected to batch normalization and ReLU activation, finally the crowd density prediction diagram is output, and the crowd counting result can be obtained by adding the number of predicted people.

Further, the weighted sum of the counting loss, the optimal transmission loss and the mean square error loss makes a loss function, and the calculation formula is as follows:

wherein Z andrepresenting the predicted density map and the true value, L ₁ Represents L1 norm, L _OT Represents the optimal transmission loss, L ₂ Represents the mean square error loss, lambda ₁ And lambda (lambda) ₂ Is a loss coefficient, set to 0.01 and 0.1, respectively;

the count penalty is defined as the absolute difference between the true and predicted counts, calculated with the L1 norm; the mean square error loss is a loss function and is used for calculating the difference between the predicted value and the actual value of the model, and converting the error into a positive number to square, so that a positive numerical value is obtained and represents the magnitude of the predicted error of the model; the calculation formula of the optimal transmission loss is as follows:

OT(p,q)＝min _γ∈Γ(p,q) ∫c(x,y)dγ(x,y)

where p and q are two probability distributions, c (x, y) is a cost function defined on (x, y), Γ (p, q) is the set of joint distributions of all p x q, minimizing the above equation to find the best mapping between p and q.

Compared with the prior art, the crowd counting method based on the trans-dimension pyramid transducer adopts a novel trans-former network structure, and the network structure utilizes the depth separable self-attention module to extract local and global characteristic information, so that the network characteristic expression capability is enhanced, and the problem of complex crowd scene background is solved. Aiming at the problems of low image feature utilization rate caused by large target scale change, shielding and the like and different feature semantic information of different scales, a feature pyramid module is designed to extract features of shallow layers and deep layers, an efficient feature pyramid module can extract feature graphs with different resolutions so as to obtain richer multi-scale features, the deep features contain rich semantic information, the features of the shallow layers can accurately contain position information of objects, and the feature pyramid module can fully utilize the semantic information from the deep layers and the detail information from the shallow layers, so that the accuracy of small target detection is improved. The regression head module of the multi-scale receptive field enhances the feature capturing capability through multi-scale cavity convolution and pyramid average pooling, and can effectively fuse features of different scales so as to predict a density map. The multi-scale cavity convolution module comprises multi-branch cavity convolution, and is suitable for crowd density estimation tasks by stacking cavity convolution layers with different expansion rates in parallel, increasing receptive fields, mining image detail information, and enhancing feature extraction. The pyramid average pooling module is designed, the influence caused by human head scale change is overcome, the information loss of a decoding part is reduced, the input images with different scales are met, and the characteristics of each spatial position can be extracted under the condition that information is not lost. In the training process, deep supervision is used, and extra supervision loss is added to each stage in the middle of the network, so that the network can be better converged. The application realizes higher-precision crowd counting and solves the problem that the image counting precision of the dense crowd with complex background and large target scale change by the main stream network is low.

Drawings

FIG. 1 is a flow schematic diagram of the present application;

FIG. 2 is a cross-scale pyramid transducer network block diagram of the present application;

FIG. 3 is a view of a transform depth separable convolution self-attention structure;

FIG. 4 is a regression head architecture diagram of a multiscale receptive field of the application;

FIG. 5 is a diagram of the partial visualization of a different method on an NWPU data set.

Detailed Description

The present application will be further illustrated by the following description of the drawings and specific embodiments, wherein it is apparent that the embodiments described are some, but not all, of the embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The application provides a crowd counting method based on a trans-scale pyramid, and relates to a deep learning technology, a convolutional neural network technology, a trans-former network technology, a deep supervision technology and the like. Firstly, designing a pyramid Transformer backbone network structure based on depth separable self-attention in a feature extraction stage, wherein the network structure can effectively capture local and global information of images, thereby effectively solving the problem of low counting precision caused by complex crowd density image background; then, a feature pyramid fusion module is designed, so that the efficient fusion of shallow detail features and deep semantic features of the dense crowd images is realized; finally, through the regression head of the multiscale receptive field, the capturing capability of the network to targets with different scales is enhanced; the method can be well applied to related tasks such as dense crowd counting, and the like, better solves the problem that the existing crowd counting method has low crowd counting precision due to complex background and large target scale change in a dense crowd scene, and provides a new research thought and technical means for automatic counting of the number of people in the dense crowd scene.

Referring to fig. 1, an embodiment of the present application includes the steps of:

(1) And acquiring a training sample, labeling pixel points of the head position of each image, and performing Gaussian smoothing processing on the training sample by using Gaussian collation to generate a corresponding real density map.

(2) Cutting the training sample obtained in the step (1) into 224 multiplied by 224, and inputting the training sample into a Patch Embedding module for preprocessing to obtain a one-dimensional vector for training.

(3) And (3) inputting the one-dimensional vector obtained in the step (2) into a pyramid transducer of the transducer backbone network model for feature extraction, and obtaining feature images with the resolution of 1/8,1/16 and 1/32 of the input image resolution.

(4) And (3) up-sampling the feature images with the resolution of 1/16 and 1/32 of the input image obtained in the step (2) to the feature images with the size of 1/8, so as to obtain three feature images with the same resolution.

(5) And (3) inputting the three feature images with 1/8 resolution obtained in the step (4) into a multi-scale cavity convolution module of the network model to obtain multi-scale feature images with different information, namely three-branch feature images.

(6) And (5) performing auxiliary training on the three-branch characteristic diagram and the real density diagram which are output in the step (5) by using a mean square error loss.

(7) And (3) adding the three-branch feature images with the same resolution and channel number output in the step (5) on the corresponding channels pixel by pixel to obtain an aggregate feature image.

(8) And (3) inputting the aggregate feature map output in the step (7) into a pyramid average pooling module to obtain a regression feature map.

(9) And (3) smoothly reducing the dimension of the regression feature map output in the step (8) and regressing a predicted density map, and adding the number of predicted people to obtain a crowd counting result.

(10) The population density prediction graph and population count results are evaluated using the count loss, the optimal transmission loss, and the mean square error loss function.

Specifically, referring to fig. 2, the method of the present application comprises the steps of:

(1) Firstly, acquiring required training samples, and acquiring RGB images of different time, different places and different human head dimensions from different scenes. And secondly, marking the data of the samples to be changed into labels, marking pixels on the heads of target people in the images, wherein one pixel represents a pedestrian, and the sum of the marked pixels is the total number of people. Finally, a Gaussian kernel is used for smoothing to generate a density map for loss training. The gaussian kernel function formula for specifically generating the density map is as follows:

wherein delta is impulse function, i and j are pixel points, x _i Delta (x-x _i ) As an impulse response function of the position of the head in the image,is Gaussian kernel, and N is total number of human heads in the image. />Is a distance x _i Head nearest m heads are the average distance from the head, < >>In the case of a dense population, approximately equal to the size of the head, the best results are obtained according to experience β=0.3.

(2) Cutting the training sample obtained in the step (1) into 224 multiplied by 224, and inputting the training sample into a Patch Embedding module for preprocessing. First, a 224×224-size picture is divided into 196 blocks according to a 16×16-size block, and this two-dimensional sequence is flattened into a one-dimensional vector for training.

Further, the training sample is input into a Patch editing module, the input image is subjected to convolution with a convolution kernel size of 16×16, a stride of 16 and the number of convolution kernels of 768, the convolution is changed from [224,224,3] to [14,14,768], and then the two dimensions of height and width are flattened to become a one-dimensional vector [196,768].

(3) And (3) calculating a backbone network of the one-dimensional vector input model obtained in the step (2) to obtain feature graphs with different resolutions, wherein the backbone network architecture is a pyramid transducer formed by four stages, and each stage comprises a common convolution layer, a maximum pooling layer and a transducer encoder module.

First, a one-dimensional vector is input as an encoder, the convolution kernel size of each stage is 7×7, the step size is 2, the pooling kernel size of the pooling layer is 3×3, and the step size is 2. The number of output channels for the four stages is 128, 256, 512, 1024. The sizes of the four-stage output characteristic diagrams are 1/4,1/8,1/16 and 1/32 of the original diagram.

Second, in the encoder transducer module, the input is self-attentive calculated over a plurality of blocks, the first block comprising a LayerNorm layer, a local multi-headed self-attentive calculation layer, a LayerNorm layer, and an MLP layer; the second block contains a LayerNorm layer, a global multi-headed self-attention calculation layer, a LayerNorm layer, and an MLP layer; each layer is connected in a jump connection mode. The number of each block in the four stages is 2, 18 and 2 respectively; the number of multi-headed self-attention layer heads per stage is 4,8, 16, 32, respectively.

Further, the transform depth separable convolution self-attention structure is shown in FIG. 3, first layer Z is formed _l-1 Is reshaped into a two-dimensional feature map, then each feature map is spatially grouped within a local window, local group self-attention (L-MSA) is calculated, and global downsampled self-attention (G-MSA) is calculated. Specifically, in the L-MSA stage, the feature map is divided into m×n sub-windows, and self-attention computation is performed only inside the sub-windows, and there is no interactive communication between each sub-window, thereby obtaining local features. In the G-MSA stage, low-dimensional features are extracted from each sub-window as a representation of the respective window, and then interact with the respective window based on this representation, thereby capturing global features. Meanwhile, necessary layer normalization (LayerNorm), multi-layer perceptron (MLP), residual connection and the like are also performed in the network structure. As shown in the transducer module in fig. 3.

The entire transducer block can be expressed as:

wherein, Z _l-1 .、Z _l 、Z _l '、Z _l ”、Z _l "' indicates the output of each layer, L-MSA indicates local grouping self-attention, G-MSA indicates global downsampling self-attention, LN is layer normalization, and MLP is composed of Linear full connection layer +tanh activation function +linear full connection layer.

The calculation mode formula of the self-attention in the transducer is as follows:

wherein Q represents query, K represents key, and V represents information extracted from one-dimensional vector. The process of matching Q and K is to calculate the correlation of the two, and the greater the correlation is, the greater the weight of the corresponding V is. d, d _k The dimensions of the input vector are calculated for multi-headed self-attention.

The Softmax is calculated as follows:

wherein z is _i And C is the number of output nodes, namely the number of classified categories, for the output value of the ith node. The multi-class output values can be converted to the range of [0,1 ] by the Softmax function]And a probability distribution of 1.

Finally, in the encoder converter module, it can be ensured that each block is normalized by using layer normalization after self-attention calculation, in the course of which important information of its characteristics is retained. Specifically, the calculation mode formula of the layer normalization is as follows:

where x represents the input feature, μ represents the mean value of the feature, σ ² Representing the variance of the feature, ε is a small positive value (e.g. 10 ^-6 ) Gamma is a scaling parameter and beta is a displacement parameter, which represents the multiplication of the corresponding elements.

The output shape and the input shape after the self-attention is convolved through the transducer depth can be kept unchanged, the input is [197,768] output or [197,768], after the layer normalization, the final result is obtained through a multi-layer perceptron (MLP), and the MLP consists of a Linear full-connection layer +tanh activation function +linear full-connection layer. The Linear full connection layer is one of the basic layers with parameters in the neural network, and the calculation formula is as follows:

y＝wx+b (6)

where w is the weight matrix, x is the input feature vector, b is the bias vector, and y is the output feature vector. Both w and b are parameters that need to be learned.

the calculation formula of the tanh activation function is as follows:

wherein x is the real number of the input. Its output range is (-1, 1), the larger the input, the closer the output is to 1; the smaller the input, the closer the output is to-1. When the input is 0, the output of the tanh function is 0.

(4) And (3) up-sampling the feature images with the resolution of 1/16 and 1/32 obtained in the step (3) into feature images with the size of 1/8 through bilinear interpolation, and changing all channels into 128 through 1X 1 convolution to obtain three feature images with the same resolution and dimension. The calculation formula of bilinear interpolation is as follows:

f(x,y)≈(1-a)(1-b)f(x ₁ ,y ₁ )+a(1-b)f(x ₂ ,y ₂ )+abf(x ₃ ,y ₃ )+(1-a)bf(x ₄ ,y ₄ ) (8)

wherein a=x-x ₁ ，b＝y-y ₁ F (x, y) is a target pixel value to be calculated. For each pixel point (x, y) in the new image, four adjacent pixel points (x ₁ ,y ₁ )，(x ₂ ,y ₂ )，(x ₃ ,y ₃ )，(x ₄ ,y ₄ ). These four points are located at four pixel points closest to (x, y) on the original image along the x and y directions, respectively. Distance weights of each adjacent pixel point in the x and y directions are calculated, and the values of the four neighborhood pixels are weighted and averaged according to the weights.

(5) And (3) inputting the three feature images with 1/8 resolution obtained in the step (4) into a multi-scale cavity convolution Module (MAC) in a regression head structure diagram of the multi-scale receptive field shown in fig. 4 to obtain multi-scale feature images of different information. The multi-scale cavity convolution module comprises four small branches, wherein the first branch consists of a convolution kernel of 1 multiplied by 1, and an output channel is 384; the second branch consists of a convolution kernel of 1 multiplied by 1, a cavity rate of 1 and a convolution kernel of 3 multiplied by 3, and the output channel is 128; the third branch consists of a convolution kernel of 3×3, a cavity rate of 2, and a convolution kernel of 3×3 with a kernel size, and the output channel is 128; the fourth branch consists of a convolution kernel of 5×5, a convolution kernel with a void fraction of 3 and a kernel size of 3×3, and the output channel is 128; the second branch, the third branch and the fourth branch are spliced and then added with the first branch to output the enhanced characteristic diagram, the final output characteristic diagram is 56 multiplied by 56, and the number of channels is 384.

Furthermore, the batch normalization layer is used after each convolution to relieve the problem of internal covariate offset caused by network updating, so that the training speed and stability of the network are improved. In addition, it also alleviates the gradient vanishing problem and reduces the risk of overfitting. The calculation formula of the batch normalization layer is as follows:

wherein x represents the input feature, ex]Representing the mean value of the input features Var [ x ]]Representing variance, ε is a small positive value (e.g., 10 ^-5 ) Gamma and beta are scaling and offset parameters, respectively.

After passing through the batch normalization layer, the network enters a ReLU activation function layer, and the network has the main advantages of linearity and non-saturation, so that the network is easier to optimize. In addition, the derivative of the ReLU function is 1 at x.gtoreq.0 and 0 at x <0, so the calculation of the gradient in back propagation is more efficient, avoiding the problem of gradient extinction. The specific calculation formula is as follows:

f(x)＝max(0,x) (10)

wherein x is an input real number, and when x >0 is input, the output value of the ReLU function is f (x) =x; when x is less than or equal to 0, the output value of the ReLU function is f (x) =0.

(6) And (3) performing auxiliary training by using the mean square error loss (L2 loss), performing auxiliary regression on the three-branch characteristic diagram and the real density diagram which are output in the step (5), optimizing model parameters, and setting the weight coefficients of auxiliary loss functions with different depths as 0.1,0.2,0.3 respectively.

The mean square error loss function measures the distance between the predicted value and the true value, and the smaller the value, the better the fitting effect of the model is indicated. The specific calculation formula is as follows:

where y represents the true value of the sample,representing the predicted value of the model, n represents the number of samples. The meaning of the formula is to divide the sum of squares of the prediction error for each data point by the number of samples n.

(7) And (5) outputting a three-branch enhanced feature map with the same resolution and channel number, and then adding the three multi-scale feature maps pixel by pixel on the corresponding channels. Specifically, the three-branch feature map has a size of 56×56, the number of channels is 384, the feature map of the addition output has a size of 56×56, and the number of channels is 384.

(8) And (3) inputting the aggregation feature map output in the step (7) into a pyramid average pooling module (PAP) in a regression head structure diagram of the multiscale receptive field, wherein the pyramid average pooling module is a four-level module, convolution kernels are respectively 2×2, 3×3, 5×5 and 7×7, each branch comprises a convolution kernel of 1×1 and up-sampling, the output is kept to be the same resolution and dimension, the output of the four-branch feature map is spliced with the aggregation feature map output in the step (7), and the final regression feature map is output, the size of the final regression feature map is 56×56, and the channel number is 384.

(9) And (3) carrying out smooth dimension reduction on the regression feature map output in the step (8) through 1X 1 convolution, and carrying out batch normalization and ReLU activation after the convolution layer is subjected to regression to obtain a predicted density map. Finally, outputting a crowd density prediction graph, and adding the number of predicted people to obtain a crowd counting result.

(10) The population density prediction graph and population count results are evaluated using the count loss, the optimal transmission loss, and the mean square error loss function. The model will be trained and optimized by optimizing the weighted sum of the three loss functions. After training, saving model parameters with minimum loss, and directly obtaining crowd density estimation graphs and crowd counting results as output during prediction.

Smoothing each annotation point using gaussian during population counting compromises generalization performance, so a distribution matching method is used for population counting. The loss function is formulated by a weighted sum of the count loss, the optimal transmission loss, and the mean square error loss. The specific calculation formula is as follows:

wherein Z andrepresenting the predicted density map and the true value, L ₁ Represents L1 norm, L _OT Represents the optimal transmission loss, L ₂ Represents the mean square error loss, lambda ₁ And lambda (lambda) ₂ Is the loss coefficient, set to 0.01 and 0.1, respectively.

The count penalty is defined as the absolute difference between the true and predicted counts, calculated using the L1 norm. The mean square error loss is a loss function in the step (6) and is used for calculating the difference (i.e. error) between the predicted value and the actual value of the model, and converting the error into a positive number for squaring, so as to obtain a positive numerical value which represents the magnitude of the predicted error of the model. The application uses the similarity between the optimal transmission measurement normalized prediction density map and the normalized ground real density map, and the optimal transmission loss is beneficial to the strong fitting capacity of the model. The calculation formula of the optimal transmission loss is as follows:

OT(p,q)＝min _γ∈Γ(p,q) ∫c(x,y)dγ(x,y) (13)

where p and q are two probability distributions, c (x, y) is a cost function defined on (x, y), Γ (p, q) is the set of joint distributions of all pxq. Minimizing the above equation can find the best mapping between p and q.

The effectiveness of the crowd counting method according to the embodiment of the application is verified through experiments as follows:

(1) Experimental details and data

The application is an end-to-end training framework based on deep learning, and is mainly realized on servers of NVIDIA GeForce RTX 3090 24GB, python3.7 and PyTorch 1.7 in the training process. The backbone employs a PVT-based improvement model that is pre-trained on the ImageNet1k dataset. Both st_partb and UCF-QNRF are cut to 512, and both st_parta and NWPU are cut to 256. The application uses AdamW optimization algorithm and the training batch size is 8, and the initial learning rate is set to be 1e-5.

To verify the counting effect of the application on dense population, experiments were performed on three data sets ShanghaiTech, UCF-QNRF, NWPU. The ShanghaiTech dataset is divided into two parts, part_a and part_b. Part_a contains 482 images, 300 images of the training set and 182 images of the test set. Part_b contains 716 images in total, 400 images in the training set and 316 images in the test set. The UCF-QNRF dataset contains 1535 images, of which the training set 1201 images and the test set 334 images. The NWPU dataset contained 5109 images, with the training set 3109 images, the validation set 500 images, and the test set 1500 images.

(2) Evaluation index

In order to further compare the performance of different counting algorithms, the application adopts an average absolute value error (MAE) and Mean Square Error (MSE) index to test the counting result and generate a density map result, wherein the smaller the MAE and MSE result is, the better the model counting effect is. The mean absolute value error and mean square error calculation formula is as follows:

where N is the number of samples from the test set, Y _i Andthe predicted number and the true number in the ith test image, respectively. And obtaining the prediction count by summing the crowd density map output by the model.

(3) Comparison with the prior art

To verify the effectiveness of the ISPT-Net embodiment of the application, experimental verification was performed on three common data sets, including the ShanghaiTech, UCF-QNRF and NWPU data sets. In addition, the comparison was also made with CSRNet, DM-Count, SUA-Fully, MFP-Net, transCrowd, DLMP-Net, SC2Net, FIDTM advanced technologies. The specific results are shown in the following table.

The inventive embodiments were compared to classical techniques on the shanghai tech dataset. Experimental results show that compared with the traditional convolutional neural network model technical results, the embodiment of the application has obvious advantages. In order to further verify the generalization capability of the model ISPT-Net provided by the application, experiments are also carried out on UCF-QNRF and NWPU data sets, and the experiments prove that the embodiment of the application is superior to other technologies again. The feature pyramid module designed by the application can contain more detail information, and is beneficial to detecting small objects; meanwhile, the multi-scale cavity convolution module and the pyramid average pooling module designed by the application can better capture multi-scale characteristics and global context information from a transducer and return the crowd number.

The comparison of the density maps output on the NWPU-crown dataset by different technical methods is carried out, and the result is shown in fig. 5, and the count result of the predicted density map is marked at the lower right corner of each picture. The first column is a negative sample with texture information similar to that of the dense population, and the second through fifth columns are scene graphs in different dense cases. The visual results show that the method provided by the application can effectively learn the mapping relation between the crowd images and the crowd density map under different scenes and different crowd densities, and the method has strong robustness. Because the transducer has good global context modeling capability, the depth-separable self-attention structure can effectively acquire local and global information of the image, and therefore the application can better solve the problems of complex counting scene, large target scale change and the like faced by crowd counting.

The pyramid transducer backbone network structure with the depth capable of being separated from the attention can effectively capture local and global information of images, and the capturing capability of the network to targets with different scales is enhanced by adopting cavity convolution and average pooling, so that the problem of low counting precision caused by complex crowd density image background is effectively solved.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced with equivalents; such modifications and substitutions do not depart from the spirit of the technical solutions according to the embodiments of the present application.

Claims

1. The crowd counting method based on the trans-scale pyramid is characterized by comprising the following steps of: firstly, acquiring a training sample, carrying out pixel point labeling on a sample image, and flattening the sample image into a real density map by using a Gaussian kernel; then, inputting the cut training sample image into a Patch Embedding module for preprocessing to form a one-dimensional vector; secondly, inputting the processed one-dimensional vector into a transform backbone network to obtain three-branch feature images with different resolutions, and up-sampling to the same size and dimension; inputting the three-branch feature map into a multi-scale cavity convolution module again, and adding the enhanced feature map pixel by pixel on a corresponding channel; then the output three-branch feature map is used for auxiliary training by mean square error loss; then inputting the output aggregate feature map into a pyramid average pooling module for feature map processing, and carrying out convolution regression on the output feature map to obtain a predicted density map; and finally, training and optimizing by using a weighted sum of the counting loss, the optimal transmission loss and the mean square error loss, and predicting a crowd density prediction graph and a crowd counting result.

2. The method for crowd counting based on a trans-scale pyramid as claimed in claim 1, wherein the training samples are obtained from different scenes to obtain RGB images of different time, different places and different head scale sizes; the pixel point labeling is to label the sample by data, label the pixel point on the head of the target crowd in the image, wherein one pixel point represents a pedestrian, and the sum of the labeled pixel points is the total number of people; the Gaussian kernel flattening is to use the Gaussian kernel function formula for generating the density map for loss training by smoothing the image marked by the Gaussian kernel pixel points, wherein the Gaussian kernel function formula for generating the density map is as follows:

wherein:

3. The crowd counting method based on the cross-scale pyramid Transformer according to claim 1, wherein the training sample image clipping is to clip the image into 224×224 size, input the image into a Patch Embedding module, the input image is convolved into one dimension vector [196,768] by a convolution kernel size of 16×16, a stride of 16 and the number of convolution kernels of 768, which is changed from [224,224,3] to [14,14,768], and flattening the two dimensions of height and width.

4. The crowd counting method based on the trans-scale pyramid trans-former according to claim 1, wherein the one-dimensional vector is input into a trans-former backbone network to perform feature extraction, and a feature map with the resolution of 1/8,1/16 and 1/32 of the input image is obtained;

5. The crowd counting method based on a trans-scale pyramid according to claim 1, wherein the up-sampling is to up-sample the obtained feature images with the resolution of 1/16 and 1/32 into the feature images with the size of 1/8 through bilinear interpolation, and change the channel of the feature images into 128 through 1 x 1 convolution, so as to obtain three feature images with the same resolution and dimension, wherein the calculation formula of the bilinear interpolation is as follows:

6. The crowd counting method based on the trans-form of the cross-scale pyramid as claimed in claim 1, wherein the three-branch feature map input multi-scale hole convolution module is to input three feature maps with 1/8 resolution into the multi-scale hole convolution module to obtain multi-scale feature maps with different information, the multi-scale hole convolution module comprises four small branches, the first branch is composed of convolution kernels with 1×1, and an output channel is 384; the second branch consists of a convolution kernel of 1 multiplied by 1, a cavity rate of 1 and a convolution kernel of 3 multiplied by 3, and the output channel is 128; the third branch consists of a convolution kernel of 3×3, a cavity rate of 2, and a convolution kernel of 3×3 with a kernel size, and the output channel is 128; the fourth branch consists of a convolution kernel of 5×5, a convolution kernel with a void fraction of 3 and a kernel size of 3×3, and the output channel is 128; the second branch, the third branch and the fourth branch are spliced and then added with the first branch to output an enhanced characteristic diagram, the size of the final output characteristic diagram is 56 multiplied by 56, and the number of channels is 384;

7. The crowd counting method based on the trans-former of the cross-scale pyramid as claimed in claim 1, wherein the mean square error loss auxiliary training is to perform auxiliary regression on the output three-branch feature map and the real density map, optimize model parameters, and set weight coefficients of auxiliary loss functions with different depths as 0.1,0.2,0.3 respectively;

8. The crowd counting method based on a trans-dimension pyramid as claimed in claim 1, wherein the feature map processing is to input the output aggregate feature map into a pyramid average pooling module, the pyramid average pooling module is a four-level module, the convolution kernel size is 2×2, 3×3, 5×5 and 7×7 respectively, each branch comprises a convolution kernel of 1×1 and upsampling, the output is kept to be the same resolution and dimension, the four-branch feature map output is spliced with the output aggregate feature map, the final regression feature map is output, the size is 56×56, and the channel number is 384.

9. The crowd counting method based on the cross-scale pyramid Transformer according to claim 1, wherein the convolution regression is that the output regression feature diagram is subjected to 1×1 convolution to carry out smooth dimension reduction and regression to obtain a predicted density diagram, the convolution layer is subjected to batch normalization and ReLU activation, the crowd density predicted diagram is finally output, and the crowd counting result can be obtained by adding the number of predicted people.

10. The crowd counting method based on a trans-scale pyramid as claimed in claim 1, wherein the weighted sum of the count loss, the optimal transmission loss and the mean square error loss is used for making a loss function, and the calculation formula is as follows:

OT(p,q)＝min _γ∈Γ(p,q) ∫c(x,y)dγ(x,y)