CN110414301A

CN110414301A - A method for estimating crowd density in train cars based on dual cameras

Info

Publication number: CN110414301A
Application number: CN201810408662.0A
Authority: CN
Inventors: 陈汉嵘; 谢晓华; 韦宝典
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2018-04-28
Filing date: 2018-04-28
Publication date: 2019-11-05
Anticipated expiration: 2038-04-28
Also published as: CN110414301B

Abstract

The present invention discloses a kind of based on double compartment crowd density estimation methods for taking the photograph head, it include: to propose multi-angle of view crowd density estimation network, the network consists of two parts, a part is the convolutional neural networks of parameter sharing, another part is full articulamentum, which can distinguish the crowd density grade in current train compartment.Model training stage is iterated optimization using the sample with 5 class density ratings；The model application stage, according to the regular Sampling Estimation of subway practical operation situation.The present invention is based on deep learning methods to estimate crowd density, replaces the feature of previous hand-designed, using the automatic learning characteristic of convolutional neural networks to improve the accuracy rate and robustness of crowd density estimation.

Description

A method for estimating crowd density in train cars based on dual cameras

技术领域technical field

本发明涉及人群密度估计技术领域，特别涉及一种基于双摄头的列车车厢人群密度估计方法。The invention relates to the technical field of crowd density estimation, in particular to a method for estimating crowd density in a train compartment based on dual cameras.

背景技术Background technique

已有的人群密度估计技术尚存在很多不足。基于像素的方法简单容易实现，但只能适用于人群密度较低的场景。基于纹理分析的方法虽然效果不错，但是运算繁杂，在实际应用中往往达不到实时性。而基于目标检测的方法能够在相对拥挤的情况下得到可靠的结果，但在人群重叠度高的场景中失去应用能力。There are still many deficiencies in the existing crowd density estimation techniques. Pixel-based methods are simple and easy to implement, but only suitable for scenes with low crowd density. Although the method based on texture analysis has a good effect, the calculation is complicated, and the real-time performance is often not achieved in practical applications. The method based on object detection can obtain reliable results in relatively crowded situations, but it loses its application ability in scenes with high crowd overlap.

现有的人群密度估计技术主要有以下几类：The existing crowd density estimation techniques mainly fall into the following categories:

1)基于像素统计的方法[1]。统计人群总面积和人群边缘的像素，根据得出的像素作为特征和总人数之间的线性关系进行人群密度估计。该方法通过背景减除和边缘检测技术，获得图像中的前景、背景以及边缘像素数。该方法主要应用在人群分布比较稀疏的场景。1) Methods based on pixel statistics [1]. The total area of the crowd and the pixels at the edge of the crowd are counted, and the crowd density is estimated according to the linear relationship between the obtained pixels and the total number of people. This method obtains the number of foreground, background and edge pixels in the image through background subtraction and edge detection technology. This method is mainly used in scenes with sparse crowd distribution.

2)基于纹理分析的方法[2]。通过灰度共生矩阵和小波包分解的方法提取图像纹理特征，然后用支持向量机、adaboost和神经网络作为分类模型对这些特征进行学习训练。该方法主要应用在人群分布比较密集的场景。2) Method based on texture analysis [2]. Image texture features are extracted by gray co-occurrence matrix and wavelet packet decomposition, and then these features are learned and trained using support vector machine, adaboost and neural network as classification models. This method is mainly used in scenes with dense crowd distribution.

3)基于目标检测的方法[3]。通过基于haar-like和haar小波变换的头部检测器，利用SVM分类器判别是否为头部，最后估计出整体人群的密度。3) The method based on object detection [3]. Through the head detector based on haar-like and haar wavelet transform, the SVM classifier is used to judge whether it is a head, and finally the density of the whole crowd is estimated.

发明内容SUMMARY OF THE INVENTION

本发明的主要目的是提出一种基于双摄头的列车车厢人群密度估计方法，旨在克服以上问题。The main purpose of the present invention is to propose a method for estimating crowd density in train compartments based on dual cameras, aiming at overcoming the above problems.

为实现上述目的，一种基于双摄头的列车车厢人群密度估计方法，包括如下步骤：To achieve the above purpose, a method for estimating crowd density in train compartments based on dual cameras includes the following steps:

S10准备训练样本：建立包含4个参数共享的卷积层和5个全连接层的神经网络，输入同一车厢内相同时刻的两个不同视角的视频帧，训练具有密度等级的标签的样本，其中卷积层用于提取视频的特征向量，全连接层用于将卷积层所提取出的特征向量按密度等级进行分类；S10 Prepare training samples: build a neural network with 4 parameter-sharing convolutional layers and 5 fully connected layers, input video frames from two different perspectives at the same time in the same car, and train samples with density-level labels, where The convolutional layer is used to extract the feature vector of the video, and the fully connected layer is used to classify the feature vector extracted by the convolutional layer according to the density level;

S20神经网络训练：数次迭代优化训练神经网络；S20 neural network training: several iterative optimization training neural network;

S30应用阶段：截取双摄头拍摄的当前列车车厢的视频帧分别输入至优化后的神经网络，得到当前列车车厢的图像分类结果。S30 application stage: intercept the video frames of the current train car captured by the dual cameras and input them to the optimized neural network to obtain the image classification result of the current train car.

优选地，所述S10中所述4个卷积层包括第一Conv层、第二Conv层、第三Conv层和第四Conv层，所述第一Conv层的卷积核大小为9×9，步长为1，卷积核数目是16个，图像输入第一Conv层生成16个特征图，接上线性整流函数Relu层和最大值池化Max-pooling层后，输出大小为288×464× 16的特征图。Preferably, the four convolution layers in S10 include a first Conv layer, a second Conv layer, a third Conv layer and a fourth Conv layer, and the size of the convolution kernel of the first Conv layer is 9×9 , the step size is 1, the number of convolution kernels is 16, and the image is input to the first Conv layer to generate 16 feature maps. After connecting the linear rectification function Relu layer and the maximum pooling Max-pooling layer, the output size is 288×464 × 16 feature maps.

优选地，所述第二Conv层的卷积核大小为7×7，步长为1，卷积核数目是32个，图像输入第二Conv层生成32个特征图，接上Relu层和Max-pooling 层，输出大小为144×232×32的特征图。Preferably, the size of the convolution kernel of the second Conv layer is 7×7, the stride is 1, the number of convolution kernels is 32, and the image is input into the second Conv layer to generate 32 feature maps, which are connected to the Relu layer and Max -pooling layer that outputs feature maps of size 144×232×32.

优选地，所述第三Conv层的卷积核大小为7×7，步长为1，卷积核数目是16个，图像输入第三Conv层生成16个特征图，再接上Relu层和 Max-pooling层，输出大小为72×116×16的特征图。Preferably, the size of the convolution kernel of the third Conv layer is 7×7, the stride is 1, the number of convolution kernels is 16, and the image is input into the third Conv layer to generate 16 feature maps, and then the Relu layer and The Max-pooling layer outputs feature maps of size 72×116×16.

优选地，所述第四个Conv层的卷积核大小为7×7，步长为1，卷积核数目是8个，图像输入第四Conv层生成8个特征图，接上Relu层和Max-pooling 层，输出大小为36×58×8的特征图。Preferably, the size of the convolution kernel of the fourth Conv layer is 7×7, the stride is 1, the number of convolution kernels is 8, and the image is input to the fourth Conv layer to generate 8 feature maps, which are connected to the Relu layer and The Max-pooling layer outputs feature maps of size 36×58×8.

优选地，所述5个全连接层包括FC5、FC6、FC7、FC8和Softmax层，第五个Conv层输出双摄头的两张36×58×8的特征图，分别输入到全连接层 FC5_0和FC5_1，得到两组1024维的特征向量；两组向量分别输入到FC6_0 层和FC6_1层得到两组512维的特征向量，接着两组512维的特征向量进行相加操作得到新的一组512维特征向量；该新的一组512维特征向量输入到 FC7层得到256维的特征向量；再将该256维的特征向量输入到FC8层得到 128维的特征向量；最后将该128维的特征向量输入到Softmax层得到一组5 维的概率向量。Preferably, the five fully connected layers include FC5, FC6, FC7, FC8 and Softmax layers, and the fifth Conv layer outputs two 36×58×8 feature maps of dual cameras, which are respectively input to the fully connected layer FC5_0 and FC5_1 to obtain two sets of 1024-dimensional feature vectors; the two sets of vectors are input to the FC6_0 layer and the FC6_1 layer to obtain two sets of 512-dimensional feature vectors, and then the two sets of 512-dimensional feature vectors are added to obtain a new set of 512 dimensional feature vector; the new set of 512-dimensional feature vector is input to the FC7 layer to obtain a 256-dimensional feature vector; then the 256-dimensional feature vector is input to the FC8 layer to obtain a 128-dimensional feature vector; finally the 128-dimensional feature vector The vector is input to the Softmax layer to obtain a set of 5-dimensional probability vectors.

优选地，所述密度等级包括ex-low、low、medium、high、ex-high，所述ex-low的样本标签为[1，0，0，0，0]，所述low的样本标签为[0，1，0，0，0]，所述medium的样本标签为[0，0，1，0，0]，所述high的样本标签为[0，0，0，1，0]，所述ex-high的样本标签为[0，0，0，0，1]，根据最后一层输出值大小判定图像的人群密度等级。Preferably, the density levels include ex-low, low, medium, high, and ex-high, the ex-low sample label is [1, 0, 0, 0, 0], and the low sample label is [0, 1, 0, 0, 0], the sample label of the medium is [0, 0, 1, 0, 0], the sample label of the high is [0, 0, 0, 1, 0], The sample label of the ex-high is [0, 0, 0, 0, 1], and the crowd density level of the image is determined according to the output value of the last layer.

优选地，所述20具体包括：Preferably, the 20 specifically includes:

S201设置神经网络的每个batch为一预定值，每次迭代输入该预定值数的样本；S201 sets each batch of the neural network as a predetermined value, and each iteration inputs the samples of the predetermined value;

S202采用gaussian初始化神经网络卷积层的参数，采用xavier方法初始化全连接层参数，所述gaussian″是高斯分布初始化为 W～N(μ，σ²)，其中μ＝0，σ²＝0.01，所述xavier是将参数以均匀分布的方式初始化，具体为其中n表示参数所在层的输入维度，m则表示输出维度；S202 uses the gaussian to initialize the parameters of the convolutional layer of the neural network, and uses the ^xavier method to ^initialize the parameters of the fully connected layer. The xavier is to initialize the parameters in a uniformly distributed manner, specifically where n represents the input dimension of the layer where the parameter is located, and m represents the output dimension;

S203使用Adam算法进行优化训练，其中Adam算法的公式如下：S203 uses the Adam algorithm for optimization training, where the formula of the Adam algorithm is as follows:

参数更新规则：Parameter update rules:

上式中，表示第t次迭代的梯度，β₁，β₂是超参数，一般设置为 0.9和0.999，∈是一个很小的数值以防分母为零，一般设置成10^-8，m_t近似看做是对的期望，v_t近似看做对的期望，而和则分别是对m_t和v_t的无偏差估计；In the above formula, Indicates the gradient of the t-th iteration, β ₁ , β ₂ are hyperparameters, generally set to 0.9 and 0.999, ∈ is a small value in case the denominator is zero, generally set to 10 ^-8 , m _t is approximately regarded as right the expectation of , v _t is approximately regarded as right expectations, while and are unbiased estimates of m _t and v _t , respectively;

S204利用损失函数Softmax迭代神经网络数次，直至达到最优，所述损失函数Softmax的公式如下：S204 uses the loss function Softmax to iterate the neural network several times until it reaches the optimum. The formula of the loss function Softmax is as follows:

其中，左项是交叉熵代价函数，[f₁，f₂，…，f_K]为网络的输出向量，在本任务中K＝5，y_i表示为此次迭代中第i个样本所对应的等级密度，右项 R(W)是正则项，W表示网络参数，λ是超参数，设置为0.0002。Among them, the left term is the cross-entropy cost function, [f ₁ , f ₂ ,..., f _K ] is the output vector of the network, in this task K=5, y _i represents the ith sample in this iteration corresponding to The rank density of , the right term R(W) is the regular term, W represents the network parameter, λ is a hyperparameter and is set to 0.0002.

优选地，所述30中将两视频帧进行加权融合，得到当前列车车厢的图像分类结果的计算公式如下：Preferably, the weighted fusion of the two video frames is performed in the step 30, and the calculation formula for obtaining the image classification result of the current train car is as follows:

class＝argmax{[F(X₁；θ)+F(X₂；θ)]/2}class=argmax{[F(X ₁ ; θ)+F(X ₂ ; θ)]/2}

其中F(X_i；θ)为网络模型的输出，X₁，X₂分别为两个摄像头输入的图像，θ为收敛模型的参数。Among them, F(X _i ; θ) is the output of the network model, X ₁ and X ₂ are the images input by the two cameras respectively, and θ is the parameter of the convergent model.

本发明为一种基于深度学习的多视角人群的人群密度估计方法，采用卷积神经网络自动学习特征来取代以往手工设计的特征，得到更加鲁棒的模型，并针对地铁车厢特殊环境提出了双端摄像头输入，从而能极端情况下处理严重遮挡的问题，提高人群密度估计的准确率。The present invention is a multi-view crowd density estimation method based on deep learning, adopts the convolutional neural network to automatically learn features to replace the previous hand-designed features, and obtains a more robust model, and proposes a dual-level model for the special environment of subway cars. The input of the end camera can be used to deal with the problem of severe occlusion in extreme cases and improve the accuracy of crowd density estimation.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图示出的结构获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments of the present invention, and for those of ordinary skill in the art, other drawings can also be obtained according to the structures shown in these drawings without creative efforts.

图1为本发明基于双摄头的列车车厢人群密度估计方法的方法流程图；Fig. 1 is a method flow chart of a method for estimating crowd density in train cars based on dual cameras of the present invention;

本发明目的的实现、功能特点及优点将结合实施例，参照附图做进一步说明。The realization, functional characteristics and advantages of the present invention will be further described with reference to the accompanying drawings in conjunction with the embodiments.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明的一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

需要说明，若本发明实施例中有涉及方向性指示(诸如上、下、左、右、前、后……)，则该方向性指示仅用于解释在某一特定姿态(如附图所示)下各部件之间的相对位置关系、运动情况等，如果该特定姿态发生改变时，则该方向性指示也相应地随之改变。It should be noted that if there are directional indications (such as up, down, left, right, front, back, etc.) involved in the embodiments of the present invention, the directional indications are only used to explain a certain posture (as shown in the accompanying drawings). If the specific posture changes, the directional indication also changes accordingly.

另外，若本发明实施例中有涉及“第一”、“第二”等的描述，则该“第一”、“第二”等的描述仅用于描述目的，而不能理解为指示或暗示其相对重要性或者隐含指明所指示的技术特征的数量。由此，限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。另外，各个实施例之间的技术方案可以相互结合，但是必须是以本领域普通技术人员能够实现为基础，当技术方案的结合出现相互矛盾或无法实现时应当认为这种技术方案的结合不存在，也不在本发明要求的保护范围之内。In addition, if there are descriptions involving "first", "second", etc. in the embodiments of the present invention, the descriptions of "first", "second", etc. are only used for the purpose of description, and should not be construed as indicating or implying Its relative importance or implicitly indicates the number of technical features indicated. Thus, a feature delimited with "first", "second" may expressly or implicitly include at least one of that feature. In addition, the technical solutions between the various embodiments can be combined with each other, but must be based on the realization by those of ordinary skill in the art. When the combination of technical solutions is contradictory or cannot be realized, it should be considered that the combination of such technical solutions does not exist. , is not within the protection scope of the present invention.

如图1所示，图1本发明基于双摄头的列车车厢人群密度估计方法的方法流程图，(a)为训练阶段，(b)为应用阶段，训练阶段中使用反向传播(Back Propagation，简称BP)算法迭代优化模型参数；测试阶段采用概率向量融合的方法提高分类准确率。As shown in Figure 1, Figure 1 is a flow chart of the method for estimating crowd density in train cars based on dual cameras of the present invention, (a) is the training stage, (b) is the application stage, and Back Propagation , referred to as BP) algorithm iteratively optimizes the model parameters; in the test phase, the method of probability vector fusion is used to improve the classification accuracy.

一种基于双摄头的列车车厢人群密度估计方法，包括如下步骤：A method for estimating crowd density in train compartments based on dual cameras, comprising the following steps:

优选地，所述S10中所述5个卷积层包括第一Conv层、第二Conv层、第三Conv层、第四Conv层和第五Conv层，所述第一Conv层的卷积核大小为9×9，步长为1，卷积核数目是16个，图像输入第一Conv层生成16个特征图，接上线性整流函数Relu层和最大值池化Max-pooling层后，输出大小为288×464×16的特征图。Preferably, the five convolution layers in S10 include a first Conv layer, a second Conv layer, a third Conv layer, a fourth Conv layer and a fifth Conv layer, and the convolution kernel of the first Conv layer The size is 9 × 9, the stride is 1, the number of convolution kernels is 16, the image is input to the first Conv layer to generate 16 feature maps, and after connecting the linear rectification function Relu layer and the maximum pooling Max-pooling layer, the output Feature maps of size 288×464×16.

优选地，所述20具体包括：Preferably, the 20 specifically includes:

参数更新规则：Parameter update rules:

上式中，表示第t次迭代的梯度，β₁，β₂是超参数，一般设置为 0.9和0.999，∈是一个很小的数值以防分母为零，一般设置成10，mt近似看做是对的期望，v_t近似看做对的期望，而和则分别是对m_t和v_t的无偏差估计；In the above formula, Indicates the gradient of the t-th iteration, β ₁ , β ₂ are hyperparameters, generally set to 0.9 and 0.999, ∈ is a small value in case the denominator is zero, generally set to 10, mt is approximately regarded as correct the expectation of , v _t is approximately regarded as right expectations, while and are unbiased estimates of m _t and v _t , respectively;

在本发明实施例中，本发明提出一个包含5个卷积层的分类神经网络。将地铁人群图片及其分类标签输入网络，利用损失函数Softmax不断迭代训练优化网络参数；本发明还提出多摄像头输入来解决遮挡问题，其最终预测结果为各个输入结果的融合，提高分类准确率。In the embodiment of the present invention, the present invention proposes a classification neural network including 5 convolution layers. The subway crowd pictures and their classification labels are input into the network, and the loss function Softmax is used to continuously train and optimize the network parameters; the invention also proposes multi-camera input to solve the occlusion problem, and the final prediction result is the fusion of each input result, which improves the classification accuracy.

相对于以前的人群密度估计技术，本发明具有以下的优点：Compared with the previous crowd density estimation technology, the present invention has the following advantages:

1、能够在地铁车厢从稀疏到极端拥挤的环境下达到满意的效果，有更好的鲁棒性；1. It can achieve satisfactory results in the environment from sparse to extremely crowded subway cars, and has better robustness;

2、端到端完成训练，与传统方法相比，没有繁杂的运算过程，并且能在实际应用中达到实时性。2. The training is completed end-to-end. Compared with the traditional method, there is no complicated operation process, and it can achieve real-time performance in practical applications.

本发明的基础在于能够在地铁车厢等密集场所中准确估计出人群密度等级，并且具有鲁棒性，实时性等效果。因此，任何基于本发明提出的人群密度等级分类应用技术都包含在本发明之内，如视频智能监控等。The basis of the invention is that it can accurately estimate the crowd density level in dense places such as subway cars, and has the effects of robustness and real-time performance. Therefore, any application technology of crowd density classification based on the present invention is included in the present invention, such as video intelligent monitoring and the like.

Claims

1. a method for estimating crowd density in train compartments based on dual cameras, is characterized in that, comprises the steps:

S10 Prepare training samples: build a neural network with 4 parameter-sharing convolutional layers and 5 fully connected layers, input video frames from two different perspectives at the same time in the same car, and train samples with density-level labels, where The convolutional layer is used to extract the feature vector of the video, and the fully connected layer is used to classify the feature vector extracted by the convolutional layer according to the density level;

S20 neural network training: several iterative optimization training neural network;

S30 application stage: intercept the video frames of the current train car captured by the dual cameras and input them to the optimized neural network to obtain the image classification result of the current train car.

2 . The method for estimating crowd density in train compartments based on dual cameras according to claim 1 , wherein the four convolution layers in S10 include a first Conv layer, a second Conv layer, and a third Conv layer. 3 . layer and the fourth Conv layer, the size of the convolution kernel of the first Conv layer is 9 × 9, the stride is 1, the number of convolution kernels is 16, and the image is input to the first Conv layer to generate 16 feature maps, which are connected to the line. After the rectification function Relu layer and the maximum pooling Max-pooling layer, the output size is 288 × 464 × 16 feature map.

3 . The method for estimating crowd density in train cars based on dual cameras according to claim 1 , wherein the convolution kernel size of the second Conv layer is 7×7, the step size is 1, and the number of convolution kernels is 1. 4 . It is 32, the image is input to the second Conv layer to generate 32 feature maps, connected to the Relu layer and the Max-pooling layer, and the output size is 144 × 232 × 32 feature maps.

4 . The method for estimating crowd density in train cars based on dual cameras according to claim 1 , wherein the convolution kernel size of the third Conv layer is 7×7, the step size is 1, and the number of convolution kernels is 1. 5 . It is 16, the image is input to the third Conv layer to generate 16 feature maps, and then connected to the Relu layer and the Max-pooling layer, and the output size is 72 × 116 × 16 feature maps.

5 . The method for estimating crowd density in train cars based on dual cameras according to claim 1 , wherein the convolution kernel size of the fourth Conv layer is 7×7, the step size is 1, and the convolution kernel is 1. 6 . The number is 8, and the image is input to the fourth Conv layer to generate 8 feature maps, which are connected to the Relu layer and the Max-pooling layer, and the output size is 36 × 58 × 8 feature maps.

6. The method for estimating crowd density in train compartments based on dual cameras according to claim 1, wherein the five fully connected layers include FC5, FC6, FC7, FC8 and Softmax layers, and the fifth Conv layer outputs The two 36×58×8 feature maps of the dual camera are input to the fully connected layers FC5_0 and FC5_1 respectively, and two sets of 1024-dimensional feature vectors are obtained; the two sets of vectors are input to the FC6_0 layer and the FC6_1 layer respectively to obtain two sets of 512-dimensional feature vectors. Then two sets of 512-dimensional feature vectors are added to obtain a new set of 512-dimensional feature vectors; the new set of 512-dimensional feature vectors is input to the FC7 layer to obtain a 256-dimensional feature vector; then the 256-dimensional feature vector is obtained. The dimensional feature vector is input to the FC8 layer to obtain a 128-dimensional feature vector; finally, the 128-dimensional feature vector is input to the Softmax layer to obtain a set of 5-dimensional probability vectors.

7 . The method for estimating crowd density in train compartments based on dual cameras according to claim 1 , wherein the density levels include ex-low, low, medium, high, and ex-high, and the ex-low The sample label is [1,0,0,0,0], the low sample label is [0,1,0,0,0], the medium sample label is [0,0,1,0, 0], the sample label of the high is [0,0,0,1,0], the sample label of the ex-high is [0,0,0,0,1], according to the size of the output value of the last layer Determine the crowd density level of the image.

8. The method for estimating crowd density in train compartments based on dual cameras according to claim 1, wherein the method 20 specifically comprises:

S201 sets each batch of the neural network as a predetermined value, and each iteration inputs the samples of the predetermined value;

S202 uses gaussian to initialize the parameters of the convolutional layer of the neural network, and uses the xavier method to initialize the parameters of the fully connected layer. The gaussian" is a Gaussian distribution initialized as W~N(μ,σ ² ), where μ=0,σ ² =0.01, The xavier is to initialize the parameters in a uniformly distributed manner, specifically where n represents the input dimension of the layer where the parameter is located, and m represents the output dimension;

S203 uses the Adam algorithm for optimization training, where the formula of the Adam algorithm is as follows:

Parameter update rules:

In the above formula, Indicates the gradient of the t-th iteration, β ₁ , β ₂ are hyperparameters, generally set to 0.9 and 0.999, ∈ is a small value in case the denominator is zero, generally set to 10 ^-8 , m _t is approximately regarded as right the expectation of , v _t is approximately regarded as right expectations, while and are unbiased estimates of m _t and v _t , respectively;

S204 uses the loss function Softmax to iterate the neural network several times until it reaches the optimum. The formula of the loss function Softmax is as follows:

Among them, the left term is the cross-entropy cost function, [f ₁ , f ₂ ,...,f _K ] is the output vector of the network, in this task K=5, y _i represents the ith sample in this iteration corresponding to The rank density of , the right term R(W) is the regular term, W represents the network parameter, λ is a hyperparameter and is set to 0.0002.

9. The method for estimating crowd density in train carriages based on dual cameras as claimed in claim 1, wherein the weighted fusion of two video frames is carried out in the described 30, and the calculation formula for obtaining the image classification result of the current train carriage is as follows:

class=argmax{[F(X ₁ ; θ)tF(X ₂ ; θ)]/2}

Among them, F(X _i ; θ) is the output of the network model, X ₁ and X ₂ are the images input by the two cameras respectively, and θ is the parameter of the convergent model.