CN112633106A - Crowd characteristic recognition network construction and training method suitable for large depth of field - Google Patents

Crowd characteristic recognition network construction and training method suitable for large depth of field Download PDF

Info

Publication number
CN112633106A
CN112633106A CN202011484694.2A CN202011484694A CN112633106A CN 112633106 A CN112633106 A CN 112633106A CN 202011484694 A CN202011484694 A CN 202011484694A CN 112633106 A CN112633106 A CN 112633106A
Authority
CN
China
Prior art keywords
network
convolution
crowd
density
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011484694.2A
Other languages
Chinese (zh)
Inventor
田青
唐绍鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Jiuhe Intelligent Technology Co ltd
Original Assignee
Suzhou Jiuhe Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Jiuhe Intelligent Technology Co ltd filed Critical Suzhou Jiuhe Intelligent Technology Co ltd
Priority to CN202011484694.2A priority Critical patent/CN112633106A/en
Publication of CN112633106A publication Critical patent/CN112633106A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • G06V20/53Recognition of crowd images, e.g. recognition of crowd congestion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

A crowd characteristic identification network construction and training method suitable for large depth of field comprises the following steps: the front-end network adopts a VGG-16 network with a full connection layer removed, and adopts a convolution kernel of 3 multiplied by 3; in a VGG-16 network at the front end, three times of maximum pooling is carried out to reduce the resolution of a feature map; the back-end network is a three-layer branched network which respectively adopts the void convolution with the void ratio of 2 and 3. The network training step comprises: 1) generate density maps, 2) loss functions, 3) evaluation criteria. In the scheme, the loss of the prior information is added: during network training, the L2 distance is used as a loss function, the traditional L2 distance is used as the loss function, the problems of overestimating the crowd in the low density area and underestimating the crowd in the high density area exist, the method for calculating the loss function in a blocking mode is used, errors caused by the problems are greatly reduced, and the counting accuracy is effectively improved.

Description

Crowd characteristic recognition network construction and training method suitable for large depth of field
Technical Field
The invention relates to the field of population counting in computer vision, in particular to a model construction and training method for population characteristics with large depth of field based on a convolutional neural network.
Background
The main task of people counting is to identify people characteristics from the image and accurately calculate the number of people in the image. Early population counts were classified into detection-based and regression-based methods. In the detection-based approach, a sliding window detector is used to detect people in the scene and count the corresponding number of people. Detection-based methods are mainly divided into two broad categories, one is ensemble-based detection and the other is partial-body-based detection. The overall detection method, for example, the typical conventional method, mainly trains a classifier to detect pedestrians by using features such as wavelets, HOG, edges, etc. extracted from the whole body of the pedestrian. The learning algorithm mainly comprises methods such as SVM, boosting and random forest. The overall-based detection method is mainly applicable to sparse population counting, but with the increase of population density, the shielding between people becomes more and more serious. Therefore, a method based on partial body detection is used to deal with the people counting problem, which mainly counts the number of people by detecting partial structures of the body, such as the head, shoulders, etc. This method is slightly more efficient than the overall-based detection.
The regression-based method has the main idea that the mapping from the characteristics to the crowd number is learned, and the steps of the method are mainly divided into two steps, wherein the first step is used for extracting low-level characteristics such as foreground characteristics, edge characteristics, textures and gradient characteristics; the second step is to learn a regression model, such as linear regression, piecewise linear regression, ridge regression, and Gaussian process regression, to learn a mapping relationship of low-level features to population.
With the deep learning DL being widely used in various research fields (computer vision, natural language processing, etc.). DL is also used by researchers in the study of population counts by virtue of its excellent feature learning ability. The method comprises the steps of extracting crowd characteristics in an image for multiple times by designing a deep neural network, fusing characteristic graphs to generate a crowd density graph, and summing to obtain the number of people in the image, so that the purpose of counting the crowd is achieved.
Disclosure of Invention
The invention solves the technical problem that when people are counted in a closed space scene, the size of the people is changed too much due to overlarge depth of field, and a conventional network cannot adapt to and recognize the characteristics of people with various sizes, so that the counting accuracy is influenced.
Specifically, the method for constructing and training the crowd characteristic recognition network suitable for large depth of field comprises the following steps:
the network front end: the front-end network adopts a VGG-16 network with a full connection layer removed, and adopts a convolution kernel of 3 multiplied by 3;
an upper sampling layer: in a VGG-16 network at the front end, three times of maximum pooling is carried out to reduce the resolution of a feature map;
the network back end: the back-end network is a three-layer branched network which respectively adopts the hole convolution with the hole rate of 2 and 3, and the definition of the hole convolution is as follows:
Figure BDA0002839030830000011
wherein: x (m, n) is input image information with the length and the width of m and n respectively, and the output y (m, n) of the cavity convolution is obtained through a convolution kernel w (i, j); the parameter r represents the void rate; if r is 1, the hole convolution is a normal convolution;
the training step of the crowd characteristic network comprises the following steps:
1) generating a density map
The method for generating the density map adopts a mode of pulse function convolution Gaussian kernel to define the density map, and if the position of the labeled point is xi, the label with N heads is represented as:
Figure BDA0002839030830000021
here, it is convolved with the Gaussian function to become a continuous function;
using a density map of a geometrically adapted gaussian kernel, represented by:
Figure BDA0002839030830000022
for the position xi of the marking point of each human head, an average value di of k neighboring distances is given, so that the pixel related to xi corresponds to a region on the ground in the scene, and the radius of the region is proportional to di;
to estimate population density around pixel xi, H (x) is convolved with an adaptive Gaussian kernel whose variance σ i is variable and proportional to di;
2) loss function
During training, the learning rate of random gradient descent is fixed to 1e-6(ii) a Measuring the distance between the generated density graph and the real value by adopting the Euclidean distance; the loss function is defined as follows:
Figure BDA0002839030830000023
n denotes the batch size, Xi denotes the picture, Z denotes the generated density map, Z denotesi GTRepresenting a density graph ground route;
3) evaluation criteria
The mean square error MSE and mean absolute error MAE are adopted; MSE is used for describing the accuracy of the model, the accuracy is higher when the MSE is smaller, and the MAE can reflect the error condition of a predicted value;
Figure BDA0002839030830000024
Figure BDA0002839030830000025
n denotes the number of pictures in a test sequence, Ci denotes the number of predictors for picture XiGTRepresenting the actual number of people;
Figure BDA0002839030830000026
Zl,wthe pixel value at (L, W) in the density map with length L and width W is shown.
The VGG-16 network at the front end of the network adopts the combination of 10 convolutional layers and 3 pooling layers.
In the scheme, the loss of the prior information is added: during network training, the L2 distance is used as a loss function, the traditional L2 distance is used as the loss function, the problems of overestimating the crowd in the low density area and underestimating the crowd in the high density area exist, the method for calculating the loss function in a blocking mode is used, errors caused by the problems are greatly reduced, and the counting accuracy is effectively improved.
Drawings
FIG. 1 is a schematic diagram of hole convolution;
fig. 2 is a schematic diagram of a network structure and a training process.
Detailed Description
The technical scheme is explained in the following with the accompanying drawings:
referring to fig. 2, a crowd counting model based on a multi-scale perception deep neural network includes:
1. the network front end:
a VGG-16 network with the full connectivity layer removed is employed and a convolution kernel of 3 x 3 is employed. Studies have shown that for the same size perceptual domain, the smaller the convolution kernel, the more convolutional layers of models are better than those with larger convolution kernels and fewer convolutional layers. To balance accuracy and resource overhead, the VGG-16 network herein employs a combination of 10 convolutional layers and 3 pooling layers.
2. Upper sampling layer
The front end adopts a VGG-16 network, three times of maximum pooling is carried out, so that the resolution of the obtained feature map is reduced, and the resolution of the feature map is restored by adopting an up-sampling method.
3. Network backend
The back-end network is a three-layer branch network, and the void convolutions with the void rates of 2 and 3 are respectively adopted, and the definition of the void convolution is as follows:
Figure BDA0002839030830000031
and x (M, N) is input image information with the length and the width being M and N respectively, and the output y (M, N) of the cavity convolution is obtained through a convolution kernel w (i, j), wherein the parameter r represents the cavity rate. If r is 1, the hole convolution is a normal convolution. Experiments prove that the hole convolution utilizes sparse convolution kernels to realize alternate convolution and pooling operation, the sensing domain is enlarged on the premise of not increasing network parameters and calculation scale, and the method is more suitable for the crowd density estimation task. While the conventional convolution operation needs to increase the number of convolution layers to obtain a larger sensing domain, and also increases more data operations. The convolution kernel of K x K is expanded to K + (K-1) (r-1) by the hole convolution operation with a hole rate of r. The perceptual domain with a convolution kernel size of 3 × 3 in fig. 1 is enlarged to 5 × 5 and 7 × 7, respectively.
4. Training method
4.1) generating a Density map
Method for generating density map reference is made to the method in MCNN (CVPR 2016). The density map is defined by convolving a gaussian kernel with a pulse function. Assuming that the location of the annotation point is xi, then a label with N heads can be represented as
Figure BDA0002839030830000032
Where it is convolved with a gaussian function to become a continuous function. But such a density function assumes that each xi is independent in image space. In fact, however, each xi is a sample of the population density in the 3D scene, and due to perspective distortion, the pixels associated with different samples xi coincide with different dimensions of the regions in the scene. Thus, to accurately estimate population density, the perspective transformation needs to be considered. If it is assumed that the population density is uniform around a head region, its nearest neighbors give a reasonable estimate of the geometric deformation. In order to make the density map better correspond to images with different viewing angles (different head sizes) and dense population, a density map with a geometrically adapted gaussian kernel is used, which is represented by the following formula:
Figure BDA0002839030830000041
for each person's head xi point, the average di of the k neighbor distances is given, so the pixel associated with xi corresponds to a region on the ground in the scene whose radius is proportional to di. Thus, to estimate population density around pixel xi, we need to convolve H (x) with an adaptive Gaussian kernel whose variance σ i is variable and proportional to di.
4.2) loss function
During training, the learning rate of the random gradient descent is fixed at 1 e-6. The euclidean distance is used to measure the distance between the density map we generated and the true value. The loss function is defined as follows:
Figure BDA0002839030830000042
n denotes a batch size, Z denotes a generated density map, and ZGT denotes a density map ground route
The L2 distance is used as a loss function to solve the problems of overestimating the crowd in the low-density area and underestimating the crowd in the high-density area, and the loss function is calculated in a blocking mode, so that the loss is calculated for the high-density area (the crowd-dense area obtained by data comparison) and the low-density area respectively to reduce the error.
4.3) evaluation criteria
When the crowd density estimation model is evaluated, in order to compare with the latest research, Mean Square Error (MSE) and Mean Absolute Error (MAE) which are commonly adopted by researchers are adopted, the MSE is used for describing the accuracy of the model, the smaller the MSE is, the higher the accuracy is, and the MAE can reflect the error condition of a predicted value.
Figure BDA0002839030830000043
Figure BDA0002839030830000044
N represents the number of pictures in a test sequence, Ci represents the predicted number of people for picture Xi, CiGT represents the actual number of people
Figure BDA0002839030830000045
Zl, W represents a pixel value at (L, W) in the density map having a length L and a width W.
In some scenarios, the crowd density varies from time to time. The scheme uses a method of calculating the loss function in a blocking mode, and the loss is calculated respectively for a high-density area (a crowd-concentrated area obtained by data comparison) and a low-density area so as to reduce the error.
Taking the statistical detection of the number of people applied to the closed space as an example, because the image collected by the camera has larger depth of field in the closed space, the size change of the head is large in the same collected image. It is more appropriate to use a multi-scale neural network for identification.
In the scheme, the loss of the prior information is added: during network training, the L2 distance is used as a loss function, the traditional L2 distance is used as the loss function, the problems of overestimating the crowd in the low density area and underestimating the crowd in the high density area exist, the method for calculating the loss function in a blocking mode is used, errors caused by the problems are greatly reduced, and the counting accuracy is effectively improved.

Claims (2)

1. A crowd characteristic identification network construction and training method suitable for large depth of field is characterized in that
The crowd characteristic identification network comprises:
the network front end: the front-end network adopts a VGG-16 network with a full connection layer removed, and adopts a convolution kernel of 3 multiplied by 3;
an upper sampling layer: in a VGG-16 network at the front end, three times of maximum pooling is carried out to reduce the resolution of a feature map;
the network back end: the back-end network is a three-layer branched network which respectively adopts the hole convolution with the hole rate of 2 and 3, and the definition of the hole convolution is as follows:
Figure FDA0002839030820000011
wherein: x (M, N) is input image information with the length and the width being M and N respectively, and the output y (M, N) of the cavity convolution is obtained through a convolution kernel w (i, j); the parameter r represents the void rate; if r is 1, the hole convolution is a normal convolution;
the training step of the crowd characteristic network comprises the following steps:
1) generating a density map:
defining a density map by adopting a mode of convolution of a Gaussian kernel by a pulse function;
assuming that the position of the marking point is xi, the label with N heads is set as H (x); if it is assumed that the population density is uniform around a head region, its nearest neighbors give a reasonable estimate of the geometric deformation;
in order to enable the density map to better correspond to images with different visual angles and dense crowds, a density map of a geometric adaptive Gaussian kernel is used; for each head location point xi, an average di of a plurality of neighboring distances is given, the pixel associated with xi corresponds to a region on the ground in the scene, the radius of which is proportional to di; to estimate population density around xi, H (x) is convolved with an adaptive Gaussian kernel whose variance σ i is variable and proportional to di;
2) loss function
During training, the learning rate of random gradient descent is fixed to 1 e-6;
measuring the distance between the generated density graph and the real value by adopting the Euclidean distance; respectively calculating losses of the high-density area and the low-density area by using a method of calculating a loss function in blocks to reduce errors;
3) evaluation criteria
When the crowd density estimation model is evaluated, mean square error MSE and mean absolute error MAE are adopted, the MSE is used for describing the accuracy of the model, the accuracy is higher when the MSE is smaller, and the MAE can reflect the error condition of a predicted value.
2. The method for constructing and training the crowd-sourcing feature recognition network of claim 1, wherein the VGG-16 network in the front end of the network is a combination of 10 convolutional layers and 3 pooling layers.
CN202011484694.2A 2020-12-16 2020-12-16 Crowd characteristic recognition network construction and training method suitable for large depth of field Pending CN112633106A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011484694.2A CN112633106A (en) 2020-12-16 2020-12-16 Crowd characteristic recognition network construction and training method suitable for large depth of field

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011484694.2A CN112633106A (en) 2020-12-16 2020-12-16 Crowd characteristic recognition network construction and training method suitable for large depth of field

Publications (1)

Publication Number Publication Date
CN112633106A true CN112633106A (en) 2021-04-09

Family

ID=75313421

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011484694.2A Pending CN112633106A (en) 2020-12-16 2020-12-16 Crowd characteristic recognition network construction and training method suitable for large depth of field

Country Status (1)

Country Link
CN (1) CN112633106A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110210603A (en) * 2019-06-10 2019-09-06 长沙理工大学 Counter model construction method, method of counting and the device of crowd
CN111563447A (en) * 2020-04-30 2020-08-21 南京邮电大学 Crowd density analysis and detection positioning method based on density map
CN111611878A (en) * 2020-04-30 2020-09-01 杭州电子科技大学 Method for crowd counting and future people flow prediction based on video image

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110210603A (en) * 2019-06-10 2019-09-06 长沙理工大学 Counter model construction method, method of counting and the device of crowd
CN111563447A (en) * 2020-04-30 2020-08-21 南京邮电大学 Crowd density analysis and detection positioning method based on density map
CN111611878A (en) * 2020-04-30 2020-09-01 杭州电子科技大学 Method for crowd counting and future people flow prediction based on video image

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LI YUHONG 等: "CSRNet: dilated convolutional neural networks for understanding the highly congested scenes", 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, 23 June 2018 (2018-06-23), pages 3 *
严芳芳 等: "多通道融合分组卷积神经网络的人群计数算法", 小型微型计算机系统, no. 10, 15 October 2020 (2020-10-15) *

Similar Documents

Publication Publication Date Title
CN108615027B (en) Method for counting video crowd based on long-term and short-term memory-weighted neural network
CN111563447B (en) Crowd density analysis and detection positioning method based on density map
TWI794414B (en) Systems and methods for real-time object detection using depth sensors
US10839543B2 (en) Systems and methods for depth estimation using convolutional spatial propagation networks
US11200424B2 (en) Space-time memory network for locating target object in video content
CN106096561B (en) Infrared pedestrian detection method based on image block deep learning features
CN107657226B (en) People number estimation method based on deep learning
Jeon et al. Road detection in spaceborne SAR images using a genetic algorithm
CN110942471B (en) Long-term target tracking method based on space-time constraint
US20040213460A1 (en) Method of human figure contour outlining in images
CN111340881B (en) Direct method visual positioning method based on semantic segmentation in dynamic scene
CN110765833A (en) Crowd density estimation method based on deep learning
CN107767416B (en) Method for identifying pedestrian orientation in low-resolution image
CN111783589B (en) Complex scene crowd counting method based on scene classification and multi-scale feature fusion
CN111191667A (en) Crowd counting method for generating confrontation network based on multiple scales
CN106157330B (en) Visual tracking method based on target joint appearance model
CN110879982A (en) Crowd counting system and method
CN112991269A (en) Identification and classification method for lung CT image
CN113822352B (en) Infrared dim target detection method based on multi-feature fusion
CN111860823B (en) Neural network training method, neural network image processing method, neural network training device, neural network image processing equipment and storage medium
CN112101195A (en) Crowd density estimation method and device, computer equipment and storage medium
CN116740439A (en) Crowd counting method based on trans-scale pyramid convertors
CN113408398A (en) Remote sensing image cloud detection method based on channel attention and probability up-sampling
CN106056078A (en) Crowd density estimation method based on multi-feature regression ensemble learning
CN110930384A (en) Crowd counting method, device, equipment and medium based on density information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination