CN107862261A

CN107862261A - Image people counting method based on multiple dimensioned convolutional neural networks

Info

Publication number: CN107862261A
Application number: CN201711014291.XA
Authority: CN
Inventors: 周圆; 杨建兴; 李成浩; 杜晓婷; 毛爱玲
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2017-10-25
Filing date: 2017-10-25
Publication date: 2018-03-30

Abstract

The invention discloses an image crowd counting method based on a multi-scale convolutional neural network, step (1), generating continuous density map labels, and transforming marked images into continuous estimated density maps; step (2), using The multi-scale convolutional neural network obtains the accurate density map of the predicted crowd. After setting an initial parameter for the convolutional neural network, the loss L(θ) of the input image is calculated according to the actual density map, and then the entire network is updated in each optimization iteration. parameter until the loss value converges to a smaller value. Compared with the existing technology, the present invention solves the huge scale change of the crowd in a single image. On the basis of a single convolutional neural network, the features of different levels of networks are integrated before generating the predicted density map, and the corresponding images of different depths are extracted. The features of different scales greatly improve the accuracy of the predicted density map; solve the problems of scale change and occlusion in crowd images.

Description

Image crowd counting method based on multi-scale convolutional neural network

技术领域technical field

本发明涉及人群图像分析技术领域，具体是一种基于多尺度卷积神经网络的人群计数算法。The invention relates to the technical field of crowd image analysis, in particular to a crowd counting algorithm based on a multi-scale convolutional neural network.

背景技术Background technique

人群计数是一种通过预测人群图像的密度图计算人的数量的智能监控应用。随着世界人口的指数增长，快速的城镇化促进了很多大规模的活动，例如体育运动比赛，公众游行，交通拥挤等问题导致大规模的人群聚集。所以为了更好地管理人群和人身安全，人群行为分析算法具有重大的意义。Crowd counting is an intelligent surveillance application that calculates the number of people by predicting the density map of crowd images. With the exponential growth of the world population, rapid urbanization has promoted many large-scale activities, such as sports competitions, public parades, traffic congestion and other issues leading to large-scale crowd gatherings. Therefore, in order to better manage crowds and personal safety, crowd behavior analysis algorithms are of great significance.

随着深度学习算法的不断推广，基于卷积神经网络的人群计数算法对比传统算法大大提高了检测精度。基于卷积神经网络的算法主要分为两种：一种是基于回归的算法，另一种是基于密度图的算法。前者是利用人群图像与对应的人数作为标签，训练卷积神经网络学习到一个从人群图像到人群数量的非线性函数映射，网络的输出是人群的个数。后者是利用人群图像和对应的密度图作为标签，去训练卷积神经网络生成与输入人群图像对应的密度图，与回归的方法不同，基于密度图的算法的网络以密度图作为输出，在根据预测的密度图去计算人群数量。但是由于人群图像大多是在监控摄像头和高空拍摄，拍摄角度存在很大变化，拍出的图像中人的大小和尺度存在很大的变化。Zhang等人提出的多列卷积神经网络在网络复杂度上很高，网络参数很大，三列网络需要预训练再将多列网络输出特征进行融合，不能同时把握单张图像的多尺度信息。With the continuous promotion of deep learning algorithms, the crowd counting algorithm based on convolutional neural network has greatly improved the detection accuracy compared with traditional algorithms. Algorithms based on convolutional neural networks are mainly divided into two types: one is a regression-based algorithm, and the other is a density map-based algorithm. The former uses the crowd image and the corresponding number of people as labels, trains the convolutional neural network to learn a nonlinear function mapping from the crowd image to the number of people, and the output of the network is the number of people. The latter uses the crowd image and the corresponding density map as labels to train the convolutional neural network to generate a density map corresponding to the input crowd image. Unlike the regression method, the algorithm network based on the density map uses the density map as the output. Calculate the crowd size based on the predicted density map. However, since crowd images are mostly taken by surveillance cameras and high altitudes, there are great changes in shooting angles, and there are great changes in the size and scale of people in the captured images. The multi-column convolutional neural network proposed by Zhang et al. has high network complexity and large network parameters. The three-column network needs to be pre-trained and then the output features of the multi-column network are fused. It cannot grasp the multi-scale information of a single image at the same time. .

发明内容Contents of the invention

本发明目的是为提利用卷积神经网络提取不同深度的特征，将不同尺度特征融合，提出了一种基于多尺度卷积神经网络的人群密度检测方法，通过从人群图像中预测密度图来计算总计人数。The purpose of the present invention is to extract features of different depths using convolutional neural networks, and to integrate features of different scales, and propose a crowd density detection method based on multi-scale convolutional neural networks, which is calculated by predicting density maps from crowd images total number of people.

本发明的一种基于多尺度卷积神经网络的图像人群计数方法，该方法包括以下步骤：A kind of image crowd counting method based on multi-scale convolutional neural network of the present invention, the method comprises the following steps:

步骤1、生成连续的密度图标签，具体包括以下处理：Step 1, generate continuous density map labels, specifically including the following processing:

将人工标记好的人头坐标生成对应的密度图，具有N个人头标记的图像表示为如下函数：The corresponding density map is generated by manually marking the human head coordinates, and the image with N human head marks is expressed as the following function:

式中，δ(x-x_i)为delta函数；x_i表示一个人头标注点所在的位置；In the formula, δ(xx _i ) is a delta function; _xi represents the position of a human head marking point;

将标注过的图像转化为连续的密度图，表达式如下：Convert the labeled image into a continuous density map, the expression is as follows:

F(x)＝H(x)^* F(x)=H(x) ^*

步骤2、利用多尺度卷积神经网络得到预测人群的精确密度图，具体包括以下处理：Step 2. Use the multi-scale convolutional neural network to obtain the accurate density map of the predicted population, which specifically includes the following processing:

多尺度卷积神经网络经过卷积-池化-再卷积-再池化的连接得到三个卷积层，从前三个卷积层提取到不同感受野的特征，将这些特征以级联合并的方式进行融合，再经过两个卷积层输出对应的密度图；The multi-scale convolutional neural network obtains three convolutional layers through the connection of convolution-pooling-reconvolution-repooling, extracts the features of different receptive fields from the first three convolutional layers, and combines these features in cascade The method is fused, and then the corresponding density map is output through two convolutional layers;

计算该多尺度卷积神经网络的损失函数L(θ)，表达式如下：Calculate the loss function L(θ) of the multi-scale convolutional neural network, the expression is as follows:

其中，N为输入卷积神经网络的图像数量，x_i为卷积神经网络的第i幅输入图像，M(x_i)表示第i幅输入图像的标准密度图矩阵；Wherein, N is the number of images input to the convolutional neural network, x _i is the i-th input image of the convolutional neural network, and M( _xi ) represents the standard density map matrix of the i-th input image;

为卷积神经网络设置一个初始参数后，根据实际的密度图算出输入图片的损失L(θ)，然后在每一次优化迭代中更新整个网络的参数，直到损失值收敛到一个较小的值。After setting an initial parameter for the convolutional neural network, calculate the loss L(θ) of the input image according to the actual density map, and then update the parameters of the entire network in each optimization iteration until the loss value converges to a smaller value.

与现有技术相比，本发明的基于多尺度卷积神经网络的图像人群计数方法具有以下效果：Compared with the prior art, the image crowd counting method based on the multi-scale convolutional neural network of the present invention has the following effects:

1、能够利用单列卷积神经网络在较低参数的情况下，结合不同深度的特征，检测到人群图像中不同尺度的行人；1. Able to use a single-column convolutional neural network to detect pedestrians of different scales in crowd images in combination with features of different depths in the case of low parameters;

2、解决了单张图像中人群巨大尺度变化，在单支卷积神经网络的基础上，在生成预测密度图前融合了不同层级网络的特征，提取到不同深度对应不同尺度的特征，极大地提升了预测密度图的精度；2. It solves the huge scale change of the crowd in a single image. On the basis of a single convolutional neural network, the features of different levels of networks are integrated before generating the predicted density map, and the features of different depths corresponding to different scales are extracted, which greatly improves Improved the accuracy of predicted density maps;

3、解决了人群图像中的尺度变化和遮挡等问题。3. Solve the problems of scale change and occlusion in crowd images.

附图说明Description of drawings

图1为本发明的基于多尺度卷积神经网络的图像人群计数方法整体流程示意图；Fig. 1 is a schematic diagram of the overall flow of the multi-scale convolutional neural network-based image crowd counting method of the present invention;

图2为多尺度卷积神经网络结构图；Figure 2 is a structural diagram of a multi-scale convolutional neural network;

图3为实验结果图；图(a)为人群图像，图(b)为对应的密度图。Figure 3 is the result of the experiment; Figure (a) is the crowd image, and Figure (b) is the corresponding density map.

具体实施方式Detailed ways

下面将结合附图对本发明的实施方式作进一步的详细描述。Embodiments of the present invention will be further described in detail below in conjunction with the accompanying drawings.

如图1所示，本发明的一种基于多尺度卷积神经网络的人群密度检测方法，将单列卷积神经网络在不同深度的特征进行融合，具体步骤如下：As shown in Figure 1, a crowd density detection method based on a multi-scale convolutional neural network of the present invention fuses the features of a single-column convolutional neural network at different depths, and the specific steps are as follows:

步骤1、生成连续的密度图标签，将标注过的图像转化为连续的估计密度图，具体包括以下处理：Step 1. Generate a continuous density map label, and convert the labeled image into a continuous estimated density map, which specifically includes the following processing:

估计密度图F(x)表达式如下：The expression of the estimated density map F(x) is as follows:

F(x)＝H(x)^* F(x)=H(x) ^*

；;

步骤2、利用多尺度卷积神经网络得到预测人群的精确密度图：多尺度卷积神经网络经过卷积-池化-再卷积-再池化的连接得到三个卷积层，从前三个卷积层提取到不同感受野的特征，该些特征由三个不同深度的卷积层提取到多层次的特征组成，随着网络的加深，越高的卷积层的感受野也会越大，在低层次的卷积层提取到的特征能够获得更多的小物体的细节信息，在高层次的卷积层获得的是高级的语义特征，将这些特征以级联合并的方式进行融合，即特征图的叠加，再经过两个卷积层输出对应的密度图。该网络的损失函数是估计密度图F(x_i；θ)和实际密度图M(x_i)之间的欧式距离L(θ)，具体表达式如下：Step 2. Use the multi-scale convolutional neural network to obtain the accurate density map of the predicted population: the multi-scale convolutional neural network obtains three convolutional layers through the connection of convolution-pooling-reconvolution-repooling, from the first three The convolutional layer extracts features of different receptive fields. These features are composed of multi-level features extracted from three convolutional layers of different depths. As the network deepens, the higher the convolutional layer, the larger the receptive field. , the features extracted in the low-level convolutional layer can obtain more detailed information of small objects, and the high-level convolutional layer obtains advanced semantic features, and these features are fused in a cascaded and combined manner, That is, the feature map is superimposed, and then the corresponding density map is output through two convolutional layers. The loss function of this network is the Euclidean distance L(θ) between the estimated density map F( _xi ; θ) and the actual density map M( _xi ), the specific expression is as follows:

其中，N为输入卷积神经网络的图像数量，x_i为卷积神经网络的第i幅输入图像，M(x_i)表示第i幅输入图像的精准密度图矩阵；Among them, N is the number of images input to the convolutional neural network, _xi is the i-th input image of the convolutional neural network, and M( _xi ) represents the precise density map matrix of the i-th input image;

为卷积神经网络设置一个初始参数后，根据实际的精准密度图算出输入图片的损失L(θ)，然后在每一次优化迭代中更新整个网络的参数，直到损失值收敛到一个较小的值。After setting an initial parameter for the convolutional neural network, calculate the loss L(θ) of the input image according to the actual precise density map, and then update the parameters of the entire network in each optimization iteration until the loss value converges to a smaller value .

由于相机拍摄角度的原因，人群图像往往会发生不同程度的透视畸变，其总体表现为距离相机较近的行人在图像中占据的面积较大，远离相机的行人在图像中占据的面积较小。本步骤中，采用多尺度的卷积神经网络监测人群图像中不同尺度的行人。在卷积神经网络中，网络中不同深度的特征代表着不同等级的特征。卷积神经网络在低层提取的是图像的轮廓和形状特征，感受野相对较小，随着网络层数的加深，深层网络提取到的是图像的高层语义特征，将网络中不同层级的特征进行叠加融合，很好的结合了人群图像中多尺度的特征，最终产生更加精确地预测人群密度图。Due to the shooting angle of the camera, the crowd images often have different degrees of perspective distortion. The overall performance is that the pedestrians who are closer to the camera occupy a larger area in the image, and the pedestrians who are far away from the camera occupy a smaller area in the image. In this step, a multi-scale convolutional neural network is used to monitor pedestrians of different scales in the crowd image. In a convolutional neural network, features at different depths in the network represent features at different levels. The convolutional neural network extracts the contour and shape features of the image at the low level, and the receptive field is relatively small. As the number of network layers deepens, the deep network extracts the high-level semantic features of the image, and the features of different levels in the network are analyzed. Superposition fusion combines the multi-scale features of the crowd image well, and finally produces a more accurate prediction of the crowd density map.

具体实施例描述如下：Specific embodiments are described as follows:

本发明需要解决的问题是“给定一张人群图像或者视频中的一帧，然后估计该图像各个区域人群的密度以及人数总计”：The problem to be solved in the present invention is "given a crowd image or a frame in a video, and then estimating the density and total number of people in each area of the image":

将已知输入图像表示为M×N的矩阵：x∈R^m×n，则该输入图像x所对应的实际人群密度表示为：Express the known input image as a matrix of M×N: x∈R ^m×n , then the actual crowd density corresponding to the input image x is expressed as:

其中，N表示图像中的人数，x表示图像中每个像素的位置，x_i为第i个人头在图像中的位置，δ(x-x_i)表示冲击函数，*表示卷积操作符号，G_δ(x)表示标准差为δ的高斯核。Among them, N represents the number of people in the image, x represents the position of each pixel in the image, x _i represents the position of the i-th head in the image, δ(xx _i ) represents the impact function, * represents the symbol of the convolution operation, G _δ (x) represents a Gaussian kernel with standard deviation δ.

该实施例的目标是学习一个由输入图像x到人群密度图的映射函数：The goal of this example is to learn a mapping function from an input image x to a crowd density map:

F:x→F(x)≈M(x)F:x→F(x)≈M(x)

其中，F(x)为估计人群密度图。Among them, F(x) is the estimated population density map.

为了学习F，需要优化下面问题：In order to learn F, the following problems need to be optimized:

其中，F(x；θ)为估计人群密度图，θ为待学习参数。一般来说，F是一个复杂非线性函数。Among them, F(x; θ) is the estimated crowd density map, and θ is the parameter to be learned. In general, F is a complex nonlinear function.

如图2所示，为本发明所利用来学习从人群图像到密度图的非线性函数F的多尺度卷积神经网络。多尺度卷积神经网络是将不同深度层级的特征进行融合。将单列卷积神经网络的第一层特征图经过一次卷积两次池化，第二层特征图经过一次卷积一次池化，将前两层得到的特征与第三层卷积得到的特征图在“通道”维度上链接在一起，形成总特征图Merged feature maps，随后再经过两个卷积层得到最后的密度图。As shown in FIG. 2 , it is a multi-scale convolutional neural network used by the present invention to learn a nonlinear function F from a crowd image to a density map. The multi-scale convolutional neural network is to fuse the features of different depth levels. The feature map of the first layer of the single-column convolutional neural network undergoes one convolution and two pooling, the second layer feature map undergoes one convolution and one pooling, and the features obtained by the first two layers and the features obtained by the third layer convolution The graphs are linked together in the "channel" dimension to form the total feature map Merged feature maps, which are then passed through two convolutional layers to obtain the final density map.

上述多尺度卷积神经网络的损失函数是估计密度图和实际密度图之间的欧氏距离：The loss function of the above multi-scale convolutional neural network is the Euclidean distance between the estimated density map and the actual density map:

训练过程中采用梯度下降法在每一次优化迭代中更新整个网络的参数L(θ)，直到损失值收敛到一个较小的值。During the training process, the gradient descent method is used to update the parameters L(θ) of the entire network in each optimization iteration until the loss value converges to a smaller value.

本发明在三个公共数据集上与其他方法进行了比较，包括商场数据集MALL、UCSD和SHANGHAITECH数据集。实验结果的评价标准采用：The present invention is compared with other methods on three public datasets, including shopping mall datasets MALL, UCSD and SHANGHAITECH datasets. The evaluation criteria for the experimental results are as follows:

平均绝对误差(MAE)： Mean Absolute Error (MAE):

和均方误差(MSE)： and mean squared error (MSE):

N为图片数量，z_i为第i幅图像中实际的人头数，为第i幅图像通过本发明提供的网络输出的人头数)来衡量算法的准确性。在MALL商场数据集上，本发明与现有算法的技术对比，如表1所示(其中MD-CNN为本发明算法)：N is the number of pictures, z _i is the actual number of heads in the i-th image, The accuracy of the algorithm is measured by the number of heads output by the network provided by the present invention for the i-th image. On the MALL shopping mall data set, the technical comparison of the present invention and existing algorithm, as shown in table 1 (wherein MD-CNN is the algorithm of the present invention):

表1Table 1

在UCSD数据集上，本发明与现有技术对比，如表2所示：On the UCSD data set, the present invention is compared with the prior art, as shown in Table 2:

表2Table 2

MethodMethod MAEMAE MSEMSE KernelridgeregressionKernelridge regression 2.162.16 7.457.45 RidgeregressionRidge regression 2.252.25 7.827.82 GaussianprocessregressionGaussian process regression 2.242.24 7.977.97 CumulativeattributeregressionCumulative attribute regression 2.072.07 6.866.86 Zhangetal.Zhangetal. 1.601.60 3.313.31 MCNNMCNN 1.071.07 1.351.35 MDCNN(ours)MDCNN(ours) 1.161.16 1.751.75

在SHANGHAITECH part_B数据集上与其他现有算法的比较如表3所示：The comparison with other existing algorithms on the SHANGHAITECH part_B dataset is shown in Table 3:

表3table 3

MethodMethod MAEMAE MSEMSE LBP+RRLBP+RR 59.159.1 87.187.1 Zhangetal.Zhangetal. 3232 49.849.8 MCNNMCNN 26.426.4 41.341.3 MDCNN(ours)MDCNN(ours) 22.322.3 39.4539.45

Claims

1. a kind of image people counting method based on multiple dimensioned convolutional neural networks, it is characterised in that this method includes following Step：

Step (1), the continuous density map label of generation, the image marked is converted into continuous estimation density map, specific bag Include following processing：

By density map corresponding to the good number of people Coordinate generation of handmarking, the graphical representation with N number of people's labeling head is following letter Number：

<mrow> <mi>H</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </munderover> <mi>&delta;</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>-</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </mrow>

In formula, δ (x-x_i) it is delta function；x_iRepresent the position where people's leader note point；

Estimate that density map F (x) expression formula is as follows：

F (x)=H (x)^*

；

Step (2), the accurate density map of prediction crowd is obtained using multiple dimensioned convolutional neural networks, specifically include following processing：

Multiple dimensioned convolutional neural networks obtain three convolutional layers by the connection in pond of the convolution of convolution-pond-again-again, from first three Individual convolutional layer extracts the wild feature of different feeling, and these features are merged in a manner of cascading merging, then by two Density map corresponding to convolutional layer output；

The loss function L (θ) of the multiple dimensioned convolutional neural networks is calculated, expression formula is as follows：

<mrow> <mi>L</mi> <mrow> <mo>(</mo> <mi>&theta;</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mi>I</mi> <mrow> <mn>2</mn> <mi>N</mi> </mrow> </mfrac> <munderover> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </munderover> <mo>|</mo> <mo>|</mo> <mi>F</mi> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>,</mo> <mi>&theta;</mi> <mo>)</mo> </mrow> <mo>-</mo> <mi>M</mi> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>|</mo> <msubsup> <mo>|</mo> <mn>2</mn> <mn>2</mn> </msubsup> </mrow>

Wherein, N be input convolutional neural networks amount of images, x_iFor the i-th width input picture of convolutional neural networks, M (x_i) table Show the accurate density map matrix of the i-th width input picture；

After one initial parameter is set for convolutional neural networks, the loss L of input picture is calculated according to the accurate density map of reality (θ), the parameter of whole network is then updated in Optimized Iterative each time, until penalty values converge to a less value.