CN115457464B

CN115457464B - Crowd counting method based on transformer and CNN

Info

Publication number: CN115457464B
Application number: CN202211084706.1A
Authority: CN
Inventors: 孔维航; 刘嘉宇; 李贺
Original assignee: Yanshan University
Current assignee: Yanshan University
Priority date: 2022-09-06
Filing date: 2022-09-06
Publication date: 2023-11-10
Anticipated expiration: 2042-09-06
Also published as: CN115457464A

Abstract

The invention discloses a crowd counting method based on transformer and CNN, which includes: obtaining training samples and performing preprocessing enhancement; inputting the enhanced RGB image into the model backbone network to obtain global feature maps of different resolutions; After sampling on the map, channel superposition is performed to obtain an aggregate feature map; the aggregate feature map is input into a multi-branch convolutional neural network to obtain a multi-scale feature map, and then added in the channel dimension to obtain a multi-scale aggregate feature map; the multi-scale aggregate feature map is obtained The graph input density map regression layer performs smooth dimensionality reduction and outputs the density map; uses the optimal transmission loss for training, and finally makes predictions. The present invention combines a pyramid transformer with a multi-branch convolutional neural network, which increases the receptive field of the model, effectively reduces the impact of variable scales, and improves prediction accuracy.

Description

Crowd counting method based on transformer and CNN

技术领域Technical field

本发明涉及基于transformer和CNN的人群计数方法，属于图像处理领域。The invention relates to a crowd counting method based on transformer and CNN, and belongs to the field of image processing.

背景技术Background technique

随着社会的飞速发展和人们生活水平的不断提高，人群聚集场景越来越多，尤其是大型场馆，交通枢纽，商场集市等公共场所，这些场景对于安全的需求不断提升，因此掌握上述场景中人群的分布状态进行安全管理及应急疏散，受到了研究人员的广泛关注。人群密度估计方法就是对给定场景中人群的密度进行估计，生成带有分布信息的密度图及给出人群的总数。With the rapid development of society and the continuous improvement of people's living standards, there are more and more crowd gathering scenes, especially in large venues, transportation hubs, shopping malls and other public places. The demand for safety in these scenes continues to increase, so master the above scenes. Conducting safety management and emergency evacuation based on the distribution status of medium-sized crowds has attracted widespread attention from researchers. The crowd density estimation method is to estimate the density of the crowd in a given scene, generate a density map with distribution information, and give the total number of crowds.

目前人群密度估计主要有三种方法：基于检测的方法，基于回归的方法和基于密度图的方法。基于检测的方法通过人工设计的窗口检测器对行人目标进行计数，在人群密集区域检测效果较差。基于回归的方法直接对总人数进行回归，缺少人群空间信息。与前两种方法相比，基于密度图的方法可以在提供总人数计数的同时，输出人群密度图，提供人群空间分布的关键信息，并且对于密集区域有一定的适应性，增加了计数的准确性，降低了方法设计的难度。因此目前人群密度估计方法主要为基于密度图的方法。There are currently three main methods for crowd density estimation: detection-based methods, regression-based methods and density map-based methods. Detection-based methods count pedestrian targets through manually designed window detectors, but have poor detection results in densely populated areas. Regression-based methods directly regress the total number of people and lack crowd spatial information. Compared with the first two methods, the method based on density map can output the crowd density map while providing the total number of people, providing key information on the spatial distribution of the crowd, and has certain adaptability to dense areas, increasing the accuracy of counting. It reduces the difficulty of method design. Therefore, the current crowd density estimation methods are mainly based on density maps.

现有的基于密度图的人群密度估计方法大多采用深度学习中的卷积神经网络模型进行密度图回归，但是人群目标存在严重的尺度多变问题，并且卷积神经网络模型感受野有限，无法有效捕获多尺度特征，最终造成计数模型准确度降低。因此，如何增加模型感受野，降低尺度多变对计数准确性的影响成为一个亟需解决的难题。Most of the existing crowd density estimation methods based on density maps use the convolutional neural network model in deep learning for density map regression. However, the crowd target has serious scale variability problems, and the convolutional neural network model has a limited receptive field and cannot be effective. Capturing multi-scale features ultimately results in reduced counting model accuracy. Therefore, how to increase the receptive field of the model and reduce the impact of scale changes on counting accuracy has become an urgent problem that needs to be solved.

发明内容Contents of the invention

针对人群场景中存在的尺度多变问题和卷积神经网络模型中感受野有限的问题，本发明提出了基于transformer和CNN的人群计数方法：将金字塔transformer与卷积神经网络相结合，通过金字塔transformer对图像全局特征进行学习，使模型具有全局感受野，降低尺度多变问题对准确率的影响；同时，通过多分支卷积神经网络进行多尺度特征学习与增强，丰富模型的特征表示并提供局部性与归纳偏置，最终准确进行密度图回归。In view of the problem of variable scale in crowd scenes and the problem of limited receptive field in the convolutional neural network model, the present invention proposes a crowd counting method based on transformer and CNN: combining the pyramid transformer with the convolutional neural network, through the pyramid transformer Learn the global features of the image so that the model has a global receptive field and reduce the impact of variable scale problems on accuracy; at the same time, multi-scale feature learning and enhancement is carried out through multi-branch convolutional neural networks to enrich the feature representation of the model and provide local properties and inductive bias, and finally perform density map regression accurately.

为解决上述技术问题，本发明所采用的技术方案是：In order to solve the above technical problems, the technical solution adopted by the present invention is:

基于transformer和CNN的人群计数方法，包括如下步骤：The crowd counting method based on transformer and CNN includes the following steps:

(1)获取训练样本，得到多场景中大量的人群RGB图像，然后获取人群标注，在每个人头位置进行像素点标注，像素点的数量代表了该场景中的总人数。然后对图像进行增强，随机水平或垂直翻转，并进行标准化，训练时将训练图像裁切为256×256大小进行训练。(1) Obtain training samples and obtain a large number of crowd RGB images in multiple scenes, then obtain crowd annotations, and label pixels at each head position. The number of pixels represents the total number of people in the scene. The images are then enhanced, randomly flipped horizontally or vertically, and standardized. During training, the training images are cropped to a size of 256×256 for training.

(2)将增强后人群RGB图像输入模型的主干网络进行计算，得到不同分辨率的全局特征图，其中主干网络是由四个阶段组成的金字塔transformer，每一个阶段都包括一个重叠图像块嵌入层和一个编码器。(2) Calculate the backbone network of the enhanced crowd RGB image input model to obtain global feature maps of different resolutions. The backbone network is a pyramid transformer composed of four stages. Each stage includes an overlapping image block embedding layer. and an encoder.

进一步的，在重叠图像块嵌入层中，输入图像在一个卷积层中被分为相互重叠的图像块，然后进行卷积操作输出二维特征图，输出的二维特征图展开为一维向量并进行正则化，该向量将作为编码器输入。第一阶段中重叠图像块嵌入层的卷积层卷积核大小为7×7，步长为4；其余三阶段重叠图像块嵌入层中卷积层的卷积核大小为3×3，步长为2；四层卷积层输出维度依次为：64，128，320，512。通过控制卷积层步长可以输出金字塔型的不同分辨率特征图。Further, in the overlapping image block embedding layer, the input image is divided into overlapping image blocks in a convolution layer, and then a convolution operation is performed to output a two-dimensional feature map. The output two-dimensional feature map is expanded into a one-dimensional vector. And regularized, this vector will be used as the encoder input. The convolution kernel size of the convolution layer in the overlapping image block embedding layer in the first stage is 7 × 7, and the step size is 4; the convolution kernel size of the convolution layer in the overlapping image block embedding layer in the remaining three stages is 3 × 3, and the step size is 4. The length is 2; the output dimensions of the four convolutional layers are: 64, 128, 320, 512. By controlling the step size of the convolution layer, pyramid-shaped feature maps of different resolutions can be output.

进一步的，在编码器中，输入向量经过多个block进行自注意力计算，每个block包括一个自注意力计算层和一个前向传播层，每层都采用跳跃连接的方式进行连接。每个阶段中block个数依次为：3，8，27，3；每个阶段中多头自注意力层的头数分别为1，2，5，8。经过编码器计算之后的向量被重塑二维特征图，并作为下一阶段的输入。最终四个阶段输出四组分辨率不同的全局特征图，分辨率依次为输入的增强后人群RGB图像分辨率的1/4，1/8，1/16，1/32。Furthermore, in the encoder, the input vector undergoes self-attention calculation through multiple blocks. Each block includes a self-attention calculation layer and a forward propagation layer, and each layer is connected by skip connection. The number of blocks in each stage is: 3, 8, 27, 3; the number of heads in the multi-head self-attention layer in each stage is 1, 2, 5, 8 respectively. The vector calculated by the encoder is reshaped into a two-dimensional feature map and used as the input of the next stage. The final four stages output four sets of global feature maps with different resolutions, and the resolutions are 1/4, 1/8, 1/16, and 1/32 of the input enhanced crowd RGB image resolution.

(3)首先将步骤(2)中提取的四组不同分辨率全局特征图上采样到相同的分辨率同时保持通道数不变，通过双线性插值的方法，将最后三个阶段的特征图上采样至第一阶段特征图的分辨率，即输入的增强后人群RGB图像的1/4大小。(3) First, the four sets of global feature maps of different resolutions extracted in step (2) are upsampled to the same resolution while keeping the number of channels unchanged, and the feature maps of the last three stages are combined using bilinear interpolation. Upsample to the resolution of the first-stage feature map, which is 1/4 the size of the input enhanced crowd RGB image.

然后将四个阶段的特征图进行聚合，聚合方法为将四个阶段的所有特征图在通道维度进行叠加，总的通道数为四个阶段特征图通道数的总和，即64+128+320+512＝1024，最后获得通道数为1024的聚合特征图。Then the feature maps of the four stages are aggregated. The aggregation method is to superimpose all the feature maps of the four stages in the channel dimension. The total number of channels is the sum of the number of channels of the feature maps of the four stages, that is, 64+128+320+ 512=1024, and finally an aggregated feature map with a channel number of 1024 is obtained.

(4)将聚合特征图输入到网络模型的多分支卷积神经网络模块，得到多尺度特征图。多分支卷积神经网络模块包括三个分支，每个分支包括一个卷积层，第一个分支卷积核大小为3×3，第二个分支卷积核大小为5×5，第三个分支卷积核大小为7×7。每一个分支的输出通道数都为256，并且每一个分支的卷积层后面都有一个批正则化层和一个ReLU激活函数层。(4) Input the aggregated feature map into the multi-branch convolutional neural network module of the network model to obtain a multi-scale feature map. The multi-branch convolutional neural network module includes three branches, each branch includes a convolution layer, the first branch convolution kernel size is 3×3, the second branch convolution kernel size is 5×5, and the third branch convolution kernel size is 5×5. The branch convolution kernel size is 7×7. The number of output channels of each branch is 256, and the convolutional layer of each branch is followed by a batch regularization layer and a ReLU activation function layer.

经过三个分支计算之后，得到三组分辨率和通道数都相同的多尺度特征图。然后将多尺度特征图在对应通道上进行逐像素相加，具体为每个对应通道上三张特征图对应位置像素进行相加，最后得到的多尺度聚合特征图通道数为256。After three branch calculations, three sets of multi-scale feature maps with the same resolution and number of channels are obtained. Then the multi-scale feature maps are added pixel by pixel on the corresponding channels, specifically the corresponding position pixels of the three feature maps on each corresponding channel are added, and the number of channels of the finally obtained multi-scale aggregated feature map is 256.

(5)将(4)步骤输出的多尺度聚合特征图输入至密度图回归层进行平滑降维并输出密度图。密度图回归层包括两层卷积层，第一层卷积层卷积核大小为3×3，步长为1，输出通道数为64，第二层卷积层的卷积核大小为1×1，步长为1，输出通道数为1，每层卷积层之后有一层批正则化层和ReLU激活函数，最终输出人群密度估计图和人群计数结果。(5) Input the multi-scale aggregated feature map output from step (4) into the density map regression layer for smooth dimensionality reduction and output a density map. The density map regression layer includes two convolution layers. The convolution kernel size of the first convolution layer is 3×3, the stride is 1, and the number of output channels is 64. The convolution kernel size of the second convolution layer is 1. ×1, the step size is 1, the number of output channels is 1, each convolutional layer is followed by a layer of batch regularization layer and ReLU activation function, and finally the crowd density estimation map and crowd counting results are output.

(6)使用最优传输损失进行训练，对人群密度估计图和总人数进行回归，优化模型参数，然后将损失最小的模型参数保存；预测时则加载保存的最小损失模型参数，直接获取人群密度估计图和人群计数结果作为预测结果。(6) Use the optimal transmission loss for training, regress the crowd density estimation map and the total number of people, optimize the model parameters, and then save the model parameters with the smallest loss; when predicting, load the saved model parameters with the smallest loss and directly obtain the crowd density The estimated map and crowd count results are used as prediction results.

由于采用了上述技术方案，本发明取得的技术进步是：Due to the adoption of the above technical solutions, the technical progress achieved by the present invention is:

1、金字塔型主干网络可以输出不同分辨率的特征图，使不同尺度的人群目标都能有丰富的特征表示信息，其中高分辨率特征图中有丰富的细节，有利于小尺度人群预测，低分辨率特征图中有丰富的语义信息，有利于大尺度人群预测，将不同分辨率特征图进行聚合可以提升人群密度图估计图的准确性。1. The pyramid-type backbone network can output feature maps of different resolutions, so that crowd targets at different scales can have rich feature representation information. The high-resolution feature maps have rich details, which is beneficial to small-scale crowd prediction. There is rich semantic information in the resolution feature map, which is beneficial to large-scale crowd prediction. Aggregating feature maps of different resolutions can improve the accuracy of crowd density map estimation.

2、采用的transformer具有全局感受野，可以利用全部输入像素进行建模。通过将transformer作为模型主干网络，可以增大模型的感受野，弥补传统卷积神经网络模型只具有局部感受野的不足，可以有效降低人群场景中的尺度多变问题对模型的影响，提高准确率。2. The transformer used has a global receptive field and can use all input pixels for modeling. By using the transformer as the backbone network of the model, the receptive field of the model can be increased, making up for the shortcomings of traditional convolutional neural network models that only have local receptive fields. It can effectively reduce the impact of scale changes in crowd scenes on the model and improve accuracy. .

3、采用多分支的卷积神经网络，使用不同大小卷积核对全局特征图进行学习，丰富特征图中对不同尺度特征的细节表示，进一步降低尺度多变问题的影响并最完成对密度图的回归。3. Adopt a multi-branch convolutional neural network and use convolution kernels of different sizes to learn the global feature map, enrich the detailed representation of features of different scales in the feature map, further reduce the impact of scale-variable issues, and ultimately complete the density map. return.

4、将transformer与卷积神经网络相结合，使模型具有全局感受野的同时也具有局部性，避免了单一transformer因缺少视觉任务归纳偏置，泛化性较差而需要大量数据进行训练的问题，同时改善了卷积神经网络感受野有限的问题，对尺度多变场景有很强的适应性。4. Combine the transformer with the convolutional neural network so that the model has a global receptive field and is also local, avoiding the problem that a single transformer needs a large amount of data for training due to the lack of visual task induction bias and poor generalization. , while improving the problem of the limited receptive field of the convolutional neural network, and has strong adaptability to scenes with varying scales.

附图说明Description of the drawings

图1是本发明的流程图；Figure 1 is a flow chart of the present invention;

图2是本发明整体网络结构示意图；Figure 2 is a schematic diagram of the overall network structure of the present invention;

图3是本发明进行训练的示意图；Figure 3 is a schematic diagram of training according to the present invention;

图4是采用本发明进行人群密度估计的示意图。Figure 4 is a schematic diagram of crowd density estimation using the present invention.

具体实施方式Detailed ways

下面结合实施例对本发明做进一步详细说明：The present invention will be further described in detail below in conjunction with the examples:

图1是本发明人群密度估计方法的流程图。如图1所示，本发明中人群密度估计方法包括如下步骤：Figure 1 is a flow chart of the crowd density estimation method of the present invention. As shown in Figure 1, the crowd density estimation method in the present invention includes the following steps:

(1)获取人群RGB图像的训练样本，从多个场景中获取大量的人群RGB图像。同时，获取对应人群RGB图像的标注数据，在目标人群的头部进行像素点标注，一个标注像素点代表一个行人，像素点之和为总人数，最终得到标注点的横纵坐标。然后对训练样本进行增强，采用随机水平和竖直翻转，并进行标准化。在训练时，将图像随机裁剪为256×256图像块大小进行训练。(1) Obtain training samples of crowd RGB images and obtain a large number of crowd RGB images from multiple scenes. At the same time, the annotation data of the RGB image of the corresponding crowd is obtained, and pixel points are annotated on the heads of the target group. One annotated pixel represents a pedestrian, and the sum of the pixel points is the total number of people. Finally, the horizontal and vertical coordinates of the annotated points are obtained. The training samples are then augmented with random horizontal and vertical flipping and normalized. During training, images are randomly cropped into 256×256 image block size for training.

人群RGB图像由R(红色)，G(绿色)，B(蓝色)三个通道组成，在通道维度对人群RGB图像进行标准化，对三个通道的均值和标准差进行约束，获得增强后人群RGB图像。标准化方式为：The crowd RGB image consists of three channels: R (red), G (green), and B (blue). The crowd RGB image is standardized in the channel dimension, and the mean and standard deviation of the three channels are constrained to obtain the enhanced crowd. RGB image. The standardization method is:

其中C为通道数据，C’为标准化之后的通道数据，mean为通道均值，std为通道标准差。然后将增强后人群RGB图像输入模型的金字塔transformer主干网络作为预测输入。Among them, C is the channel data, C’ is the channel data after normalization, mean is the channel mean, and std is the channel standard deviation. The enhanced crowd RGB image is then input into the pyramid transformer backbone network of the model as the prediction input.

(2)图2是本发明整体网络结构示意图，对网络进行搭建。其中金字塔transformer主干网络由四个阶段组成，输入完成之后，增强后人群RGB图像依次经过主干网络的四个阶段，获得不同分辨率的特征图，每个阶段输出特征图的分辨率依次为增强后人群RGB图像分辨率的1/4，1/8，1/16，1/32。(2) Figure 2 is a schematic diagram of the overall network structure of the present invention, which is used to build the network. The pyramid transformer backbone network consists of four stages. After the input is completed, the enhanced crowd RGB image passes through the four stages of the backbone network in sequence to obtain feature maps of different resolutions. The resolution of the output feature map at each stage is in turn after enhancement. 1/4, 1/8, 1/16, 1/32 of the crowd RGB image resolution.

主干网络的每一个阶段都包括一个重叠图像块嵌入层和一个编码器。在每个阶段中，输入特征图首先经过图像块嵌入层，由二维特征图映射为一维向量，一维向量在编码器中进行自注意力计算，重塑为二维特征图之后得到全局特征图，作为下一阶段的输入。Each stage of the backbone network consists of an overlapping patch embedding layer and an encoder. In each stage, the input feature map first passes through the image block embedding layer, and is mapped from the two-dimensional feature map to a one-dimensional vector. The one-dimensional vector is calculated by self-attention in the encoder and reshaped into a two-dimensional feature map to obtain the global The feature map is used as input to the next stage.

下面对每个阶段的结构及步骤进行具体说明：The structure and steps of each stage are explained in detail below:

第一个阶段中的重叠图像块嵌入层包括一个卷积层和一个正则化层，具体的，卷积层的卷积核大小为7×7，步长为4，输入通道数为3，输出通道数为64。步长大于1，使增强后人群RGB图像经过卷积层后分辨率降为输入图像的1/4，且步长小于卷积核的大小，使卷积核覆盖图像块之间可以重叠，增加局部信息交互。然后输出的二维特征图展平为一维向量，保留全部像素信息的同时可以满足计算要求。将展平之后的向量输入正则化层在通道维度进行正则化处理，便于模型收敛。最后将正则化之后的向输入编码器。第一阶段编码器包括3个block，每个block由多头自注意力层和前向传播层组成，每层都采用跳跃连接的方式进行连接。自注意力计算层包括一个正则化层和一个多头自注意力层，其中正则化层将输入向量在通道维度进行正则化。第一阶段多头自注意力层的头数为1，输入向量维度为64，计算之后的向量输入前向传播层。The overlapping image patch embedding layer in the first stage includes a convolution layer and a regularization layer. Specifically, the convolution kernel size of the convolution layer is 7×7, the stride is 4, the number of input channels is 3, and the output The number of channels is 64. The step size is greater than 1, so that the resolution of the enhanced crowd RGB image is reduced to 1/4 of the input image after passing through the convolution layer, and the step size is smaller than the size of the convolution kernel, so that the image blocks covered by the convolution kernel can overlap, increasing Local information exchange. Then the output two-dimensional feature map is flattened into a one-dimensional vector, which can meet the calculation requirements while retaining all pixel information. The flattened vector is input to the regularization layer for regularization processing in the channel dimension to facilitate model convergence. Finally, the regularized result is fed into the encoder. The first-stage encoder includes 3 blocks. Each block consists of a multi-head self-attention layer and a forward propagation layer. Each layer is connected using skip connections. The self-attention calculation layer includes a regularization layer and a multi-head self-attention layer, where the regularization layer regularizes the input vector in the channel dimension. In the first stage, the number of heads of the multi-head self-attention layer is 1, the input vector dimension is 64, and the calculated vector is input to the forward propagation layer.

自注意力计算方式为：The self-attention calculation method is:

其中Q，K，V代表query矩阵，key矩阵，value矩阵，通过输入向量与权重矩阵W_Q，W_K，W_V相乘得到，d_k为多头自注意力计算时输入向量的维度。Among them, Q, K, and V represent the query matrix, key matrix, and value matrix, which are obtained by multiplying the input vector and the weight matrix W _Q , W _K , and W _V. d _k is the dimension of the input vector during multi-head self-attention calculation.

softmax计算方式为：The softmax calculation method is:

前向传播层由两个全连接层和一个深度卷积层组成。第一个全连接层的输出维度为输入维度的8倍，第二个全连接层的输出维度为前向传播层输入向量的维度，两个全连接层之间有一个深度卷积层，深度卷积层首先将向量重塑为二维特征图，然后将二维特征图分为与通道维数相同的组，在每一个组上进行单独的卷积，作为对特征图的位置编码，卷积核大小为3×3，步长为1，输入通道和输出通道相同，均为输入向量的通道数。计算完成之后将二维特征图展平为应为向量作为第二个全连接层的输出。在第一个全连接层和深度卷积层之后均通过GELU激活函数进行激活，激活函数的计算方式为：The forward propagation layer consists of two fully connected layers and a deep convolutional layer. The output dimension of the first fully connected layer is 8 times the input dimension, the output dimension of the second fully connected layer is the dimension of the input vector of the forward propagation layer, and there is a depth convolution layer between the two fully connected layers. The convolutional layer first reshapes the vector into a two-dimensional feature map, then divides the two-dimensional feature map into groups with the same channel dimension, and performs a separate convolution on each group as a position encoding of the feature map. The kernel size is 3×3, the step size is 1, the input channel and the output channel are the same, both are the number of channels of the input vector. After the calculation is completed, the two-dimensional feature map is flattened into a vector as the output of the second fully connected layer. After the first fully connected layer and the deep convolution layer, they are activated through the GELU activation function. The activation function is calculated as:

GELU(x)＝xP(X≤x) (4)GELU(x)＝xP(X≤x) (4)

其中x为输入向量，P为高斯分布的累积分布。然后经过前向传播层的向量被重塑为二维特征图，作为第二阶段的输入。where x is the input vector and P is the cumulative distribution of Gaussian distribution. The vectors passing through the forward propagation layer are then reshaped into two-dimensional feature maps as input to the second stage.

进一步的，第二阶段，第三阶段和第四阶段的组成结构与第一阶段类似，仅为部分层数不同。第二，三，四阶段中重叠图像块嵌入层中卷积层的卷积核为3×3，步长为2，输出维度依次为128，320，512；编码器中的block数量依次为4，18，3；多头自注意力层的头数分别为2，5，8；前向传播层中第一层全连接层的输出通道数依次为输入通道数的8倍，4倍，4倍。第一，第二，第三阶段的输出均作为下一阶段的输入。最后，四个阶段共输出四组分辨率不同的二维全局特征图。Furthermore, the composition structures of the second, third and fourth stages are similar to those of the first stage, with only some differences in the number of layers. In the second, third, and fourth stages, the convolution kernel of the convolutional layer in the overlapping image block embedding layer is 3×3, the stride is 2, and the output dimensions are 128, 320, and 512 respectively; the number of blocks in the encoder is 4 , 18, 3; the number of heads of the multi-head self-attention layer is 2, 5, and 8 respectively; the number of output channels of the first fully connected layer in the forward propagation layer is 8 times, 4 times, and 4 times the number of input channels. . The outputs of the first, second, and third stages are used as inputs to the next stage. Finally, the four stages output a total of four sets of two-dimensional global feature maps with different resolutions.

(3)将四阶段的全局特征图进行上采样，通过双线性插值的方法，将第二阶段的输出特征图上采样两倍，将第三阶段的输出特征图上采样四倍，将第四阶段的输出特征图上采样八倍，而第一阶段输出特征图大小不变。(3) Upsample the global feature map of the four stages. Using bilinear interpolation, upsample the output feature map of the second stage twice, upsample the output feature map of the third stage four times, and upsample the output feature map of the third stage four times. The output feature map of the four stages is upsampled eight times, while the output feature map of the first stage remains unchanged in size.

进一步的，四阶段特征图被上采样至相同大小后在通道维度进行叠加得到聚合特征图，聚合特征图的输出通道数为四个阶段通道数之和，分辨率为增强后人群RGB图像的1/4。Furthermore, the four-stage feature maps are upsampled to the same size and then superimposed in the channel dimension to obtain an aggregate feature map. The output channel number of the aggregate feature map is the sum of the channel numbers of the four stages, and the resolution is 1 of the enhanced crowd RGB image. /4.

(4)将聚合特征图输入多分支卷积神经网络模块得到多尺度特征图。其中多分支卷积神经网络模块由三个分支组成，每个分支包括一个卷积层,一个正则化层和一个ReLU激活函数。具体的，三个分支中卷积层的卷积核大小依次为3×3，5×5，7×7，输出通道数均为256。其中ReLU激活函数的计算方式为：(4) Input the aggregated feature map into the multi-branch convolutional neural network module to obtain a multi-scale feature map. The multi-branch convolutional neural network module consists of three branches, each branch includes a convolution layer, a regularization layer and a ReLU activation function. Specifically, the convolution kernel sizes of the convolutional layers in the three branches are 3×3, 5×5, and 7×7, and the number of output channels is 256. The calculation method of the ReLU activation function is:

进一步的，聚合特征图经过三个分支之后，输出三组分辨率和通道数都相同的多尺度特征图，然后将多尺度特征图在通道维度按像素位置相加，具体的，在三组特征图对应通道上，相同位置的像素进行相加，输出分辨率为增强后人群RGB图像的1/4，输出通道为256，得到多尺度聚合特征图。Further, after the aggregated feature map passes through three branches, three sets of multi-scale feature maps with the same resolution and number of channels are output, and then the multi-scale feature maps are added by pixel position in the channel dimension. Specifically, in the three sets of features On the corresponding channel of the image, pixels at the same position are added, the output resolution is 1/4 of the enhanced crowd RGB image, the output channel is 256, and a multi-scale aggregated feature map is obtained.

(5)对多尺度聚合特征图进行平滑与降维，最后输出人群密度估计图与人数统计结果。进行平滑与降维操作的是两个卷积层，每个卷积层后面都有一个正则化层和一个ReLU激活函数。第一个卷积层的卷积核大小为3×3，输出通道为64，第二个卷积层的卷积核大小为1×1，输出通道数为1。经过降维与平滑后的特征图为最后的人群密度估计图，将密度图中所有像素相加得到最终的人数统计结果。(5) Smooth and reduce the dimensionality of the multi-scale aggregated feature map, and finally output the crowd density estimation map and the number of people statistics results. Two convolutional layers are used to perform smoothing and dimensionality reduction operations. Each convolutional layer is followed by a regularization layer and a ReLU activation function. The convolution kernel size of the first convolution layer is 3×3 and the output channel is 64. The convolution kernel size of the second convolution layer is 1×1 and the number of output channels is 1. The feature map after dimensionality reduction and smoothing is the final crowd density estimation map, and all pixels in the density map are added to obtain the final people counting result.

(6)图3是本发明人群密度估计方法进行训练的示意图，首先对网络进行训练，本发明采用的损失函数为最优传输损失，如下所示：(6) Figure 3 is a schematic diagram of the crowd density estimation method of the present invention for training. First, the network is trained. The loss function used in the present invention is the optimal transmission loss, as shown below:

L＝L₁+L_OT (6)L＝L ₁ +L _OT (6)

L₁＝||S||₁-||S'||₁ (7)L ₁ =||S|| ₁ -||S'|| ₁ (7)

其中L是网络模型的总体损失函数，L₁为总人数绝对误差，对人数统计结果进行优化，L_OT为最优传输损失，对人群密度估计图的分布进行优化。S为人群标注点图，||S||₁为S的一范数；S’为人群密度估计图，||S’||₁为S’的一范数。Φ为Wasserstein距离计算。Among them, L is the overall loss function of the network model, L ₁ is the absolute error of the total number of people, which is used to optimize the number of people statistics results, and L _OT is the optimal transmission loss, which is used to optimize the distribution of the crowd density estimation map. S is the crowd label point map, ||S|| ₁ is the norm of S; S' is the crowd density estimation map, ||S'|| ₁ is the norm of S'. Φ is calculated by Wasserstein distance.

训练时通过反向传播梯度下降法对模型参数进行优化，设定目标阈值，选择损失函数最小时的模型参数进行保存，完成模型训练。图4是采用本发明人群密度估计方法进行人群密度估计的示意图。预测时待预测图像不进行增强，直接输入网络，得到人群密度估计图和总人数作为最终预测结果。During training, the model parameters are optimized through the backpropagation gradient descent method, the target threshold is set, and the model parameters with the smallest loss function are selected and saved to complete the model training. Figure 4 is a schematic diagram of crowd density estimation using the crowd density estimation method of the present invention. During prediction, the image to be predicted is not enhanced and is directly input into the network to obtain the crowd density estimation map and the total number of people as the final prediction result.

以上所述的实施例仅是对本发明的优选实施方式进行描述，并非对本发明的范围进行限定，在不脱离本发明设计精神的前提下，本领域普通技术人员对本发明的技术方案做出的各种变形和改进，均应落入本发明权利要求书确定的保护范围内。The above-described embodiments are only descriptions of preferred embodiments of the present invention and do not limit the scope of the present invention. Without departing from the design spirit of the present invention, those of ordinary skill in the art may make various modifications to the technical solutions of the present invention. All modifications and improvements shall fall within the protection scope determined by the claims of the present invention.

Claims

1. The crowd counting method based on the transformer and the CNN is characterized by comprising the following steps of: the method comprises the following steps:

(1) Obtaining training samples, obtaining crowd RGB images of multiple scenes, and preprocessing and enhancing the crowd RGB images;

(2) Inputting the enhanced crowd RGB images into a backbone network of a model for calculation, wherein the backbone network comprises a pyramid transducer formed by four stages, and the enhanced crowd RGB images sequentially pass through the four stages of the backbone network to obtain global feature images with different resolutions; wherein each stage includes an overlapping tile embedding layer and an encoder;

(3) Carrying out channel superposition after upsampling on global feature graphs with different resolutions to obtain an aggregate feature graph;

the step (3) specifically comprises:

firstly, up-sampling four groups of global feature images with different resolutions extracted in the step (2) to the same resolution, keeping the number of channels unchanged, and up-sampling the feature images of the last three stages to the resolution of the feature images of the first stage by a bilinear interpolation method, namely enhancing the 1/4 size of the RGB image of the crowd;

then, the feature images of the four stages are aggregated, wherein the aggregation method is to superimpose all the feature images of the four stages in the channel dimension, the total channel number is the sum of the channel numbers of the feature images of the four stages, namely 64+128+320+512=1024, and finally, the aggregation feature image with the channel number of 1024 is obtained;

(4) Inputting the aggregation feature map into a multi-branch convolutional neural network to obtain a multi-scale feature map, and adding the multi-scale feature map in the channel dimension to obtain the multi-scale aggregation feature map;

the step (4) specifically comprises:

the multi-branch convolutional neural network module comprises three branches, each branch comprises a convolutional layer, the first branch has a convolutional kernel size of 3 multiplied by 3, the second branch has a convolutional kernel size of 5 multiplied by 5, and the third branch has a convolutional kernel size of 7 multiplied by 7; the number of output channels of each branch is 256, and a batch regularization layer and a ReLU activation function layer are arranged behind the convolution layer of each branch;

after three branch calculation, three groups of multi-scale feature images with the same resolution and channel number are obtained; then adding the multi-scale feature images pixel by pixel on the corresponding channels, specifically adding pixels at positions corresponding to the three feature images on each corresponding channel, and finally obtaining 256 channels of the multi-scale aggregate feature images;

(5) Inputting the multi-scale aggregation feature map into a density map regression layer to carry out smooth dimension reduction and output a density map;

(6) Training is carried out by using the optimal transmission loss, and finally prediction is carried out.

2. The transformer and CNN based population counting method of claim 1, wherein: the step (1) is to acquire labeling data of the RGB images of the crowd before preprocessing and enhancing, pixel point labeling is carried out at each head position, and the number of the pixel points represents the total number of people in the scene.

3. The transformer and CNN based population counting method of claim 1, wherein: the preprocessing enhancement in the step (1) specifically comprises random horizontal or vertical overturn and standardization, and training is carried out by cutting a training image into 256×256 image blocks during training.

4. The transformer and CNN based population counting method of claim 1, wherein: the step (2) specifically comprises:

in the overlapped image block embedding layer, an input image is divided into mutually overlapped image blocks in one convolution layer, then a two-dimensional feature map is output through convolution operation, and the output two-dimensional feature map is unfolded into one-dimensional vectors and regularized to be used as encoder input; the convolution kernel size of the convolution layer of the overlapped image block embedded layer in the first stage is 7 multiplied by 7, and the step length is 4; the convolution kernel size of the convolution layer in the other three-stage overlapped image block embedded layer is 3 multiplied by 3, and the step length is 2; the output dimensions of the four layers of convolution layers are as follows: 64 128, 320, 512; outputting pyramid-type feature graphs with different resolutions by controlling the step length of the convolution layer;

in the encoder, the input vector carries out self-attention calculation through a plurality of blocks, each block comprises a self-attention calculation layer and a forward propagation layer, and each layer is connected in a jump connection mode; the number of blocks in each stage is as follows: 3,8, 27,3; the number of the multi-head self-attention layers in each stage is 1,2,5 and 8 respectively; the vector after being calculated by the encoder is remodeled into a two-dimensional feature map and is used as input of the next stage; and finally, outputting four groups of global feature maps with different resolutions at four stages, wherein the resolutions are sequentially 1/4,1/8,1/16 and 1/32 of the input enhanced crowd RGB image resolution.

5. The transformer and CNN based population counting method of claim 1, wherein: the step (5) specifically comprises:

the density map regression layer comprises two layers of convolution layers, wherein the convolution kernel size of the first layer of convolution layer is 3 multiplied by 3, the step length is 1, the number of output channels is 64, the convolution kernel size of the second layer of convolution layer is 1 multiplied by 1, the step length is 1, the number of output channels is 1, a batch of regularization layers and a ReLU activation function are arranged behind each layer of convolution layer, and finally the crowd density estimation map and crowd counting result are output.

6. The transformer and CNN based population counting method of claim 1, wherein: the step (6) specifically comprises:

training by using optimal transmission loss, carrying out regression on a crowd density estimation graph and the total number of people, optimizing model parameters, and then storing the model parameters with minimum loss; and loading the stored minimum loss model parameters during prediction, and directly acquiring a crowd density estimation graph and a crowd counting result as prediction results.